Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cuda12 variant of tensorflow-notebook #2100

Merged
merged 13 commits into from
Mar 26, 2024

Conversation

ChristofKaufmann
Copy link
Contributor

Describe your changes

  • This adds a cuda12 variant of the tensorflow-notebook analog to eccda24 with the pytorch-notebook.
  • The CPU version uses the tensorflow-cpu wheel now (to reduce size of the image).
  • Regarding cuda11 variant: The current version of TensorFlow is 2.16.1 and it seems the last compatible PyPI wheel with CUDA 11.8 is TensorFlow 2.14.1 (according to the officially tested versions). I still tried TensorFlow 2.16.1 with CUDA 11.8.0, but it didn't work. The current version of tensorflow-gpu on conda-forge is 2.15.0 and has a CUDA 11.8 build. So if you want a cuda11 variant, I can try to use the conda-forge version for that, but the TensorFlow version is not up-to-date.

Issue ticket if applicable

Fix: #2095, #1557.

Checklist (especially for first-time contributors)

  • I have performed a self-review of my code
  • If it is a core feature, I have added thorough tests
  • I will try not to use force-push to make the review process easier for reviewers
  • I have updated the documentation for significant changes

@mathbunnyru
Copy link
Member

Could you please fix tests?

docs/using/selecting.md Outdated Show resolved Hide resolved
images/tensorflow-notebook/cuda12/Dockerfile Outdated Show resolved Hide resolved
SHELL ["/bin/bash", "-o", "pipefail", "-c"]

# Install CUDA libs and cuDNN with mamba
RUN mamba install --yes -c nvidia/label/cuda-12.3.2 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a chance not to hardcode minor.patch version here?
Also, 12.4 was released.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used 12.3, because it is listed in the tested build configurations. There seems to be no label like cuda-12. Without label, i. e. -c nvidia I got the latest CUDA version, which is 12.4, currently. It worked, but I thought it is risky. A new CUDA release might be incompatible (I guess not before 13.x) and it seems we do not have a unit test e. g. to check the output of python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))". Should we add a unit test like this?
Secondly, the pytorch-notebook fixes the minor version as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used 12.3, because it is listed in the tested build configurations.

There is a chance that the table might be slightly outdated.

There seems to be no label like cuda-12. Without label, i. e. -c nvidia I got the latest CUDA version, which is 12.4, currently.

Can we use something like nvidia/label/cuda-12.*?

Should we add a unit test like this?

I am not sure if this test will work with a regular GitHub-hosted ubuntu runner.
Also, we currently don't have a way to run a test for variant image (but it's not difficult to add something like this).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a chance that the table might be slightly outdated.

I looked at the table just a few hours after the release and it was up to date. But it won't ever be tested against a newer version.

Can we use something like nvidia/label/cuda-12.*?

No, there is no such label. Here is a list.

I am not sure if this test will work with a regular GitHub-hosted ubuntu runner.

Right, that is always problematic. Sorry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use something like nvidia/label/cuda-12.*?

No, there is no such label. Here is a list.

I meant maybe mamba supports label regex (I have no idea if it does or not).
In that case we will be fine with existing tags and won’t need to hardcode some particular version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite using labels, I noticed all package versions are also in the main label. The version in the list of labels is a bit strange. So the labels are formatted in major.minor.patch and the version in major.minor.build. Usually, if patch increments, the build number just continues:

  • label: cuda-11.6.0, version: 11.6.55
  • label: cuda-11.6.1, version: 11.6.112
  • label: cuda-11.6.2, version: 11.6.124

There is one exception: labels cuda-11.4.3 and cuda-11.4.4 both have version 11.4.152.
Nevertheless, I just tried:

  • mamba install -c nvidia 'cuda-nvcc<13' and got version 12.4.99 (which is also in label cuda-12.4.0)
  • mamba install -c nvidia 'cuda-nvcc<12' and got version 11.8.89 (which is also in label cuda-11.8.0)
  • mamba install -c nvidia 'cuda-nvcc<11.5' and got version 11.4.152 (which is also in labels cuda-11.4.3 and cuda-11.4.4)
  • mamba install -c nvidia 'cuda-nvcc=12.3' and got version 12.3.107 (which is also in label cuda-12.3.2)

So, we could use 'cuda-nvcc<13', to reduce maintenance work (avoid updating the version for every TensorFlow release), but these are not officially tested by TensorFlow (not sure, if there can occur incompatibilities with new minor versions). Using something like 'cuda-nvcc=12.3' is more work (still avoiding the patch version), but officially tested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with cuda-nvcc=12.3, please add the comment why we choose this version though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This did not work, since the dependencies have no limitation in their versions. So using cuda-nvcc=12.3 with the nvidia channel resulted in a mixture of 12.3 and 12.4. NVIDIA is quite sloppy in their packaging.

Then I noticed, that cudnn from the nvidia channel is outdated. Apparently they dropped the support 3 years ago. The cudnn from conda-forge is quite up-to-date, but there is no CUDA 12 build yet, only CUDA 11.8.

So, I would like to go with the new installation method supported by TensorFlow, which is basically just pip install tensorflow[and-cuda]. This also has the advantage, that the installed CUDA version is always the officially tested version – so less maintenance for you. Usually the path to the nvidia libs should be found automatically, but in 2.16.1 there seems to be a bug, so we have to add them. I prepared an activation script for that. Also LD_LIBRARY_PATH is not polluted with this method, because the paths from the pip installation contain only the nvidia libs. Before we added ${CONDA_DIR}/lib/, which contains quite a lot libraries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let’s try it

images/tensorflow-notebook/cuda12/Dockerfile Outdated Show resolved Hide resolved
NVIDIA_DRIVER_CAPABILITIES="compute,utility"

# Install Tensorflow with pip
RUN pip install --no-cache-dir tensorflow && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use tensorflow from mamba?
I think in such a case we won't even need to list the dependencies.

Copy link
Collaborator

@consideRatio consideRatio Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The current version of tensorflow-gpu on conda-forge is 2.15.0 and has a CUDA 11.8 build. So if you want a cuda11 variant, I can try to use the conda-forge version for that, but the TensorFlow version is not up-to-date.

This is the kind of complexity I recall when getting a tensorflow gpu image working, where the conda-forge version is often outdated, and trying to install with pip for gpu support was complicated.

I think if we choose between install complexity or relying on something regularly outdated, the install complexity may be prefered - otherwise we introduce things we can't control. At the same time, the fact that its outdated etc relates to how complicated it may be to keep installing something that works over time, which we then may be taking on.

This PR will probably demonstrate the current maturity of tensorflow gpu stuff upstream, if its as bad as I experienced it was a while back, then I think its better to not try to try maintain a tensorflow gpu image to avoid making this project too hard to maintain as a whole.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was very complicated and the conda-forge tensorflow-gpu package helped by providing cudatoolkit and cudnn within the same toolchain. But I just tried to install tensorflow-gpu from conda-forge into scipy-notebook and it fails due to conflicts. So maybe it changed and nowadays the installation of tensorflow from PyPI and cuda and cudnn from nvidia's conda-channel is the easiest way. For maintenance I imagine using the cuda version from the tested builds for a new TensorFlow release should work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, we can't tell if the current maturity of tensorflow packages is better until we merge this PR, have weekly builds, and have several releases of cuda/cudnn and the tensorflow package itself.

So, I am ok with how Dockerfile currently looks like, but we need to see if it's gonna be ok after a few releases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ready to give it a try as a maintainer (and I can always disable the build just by changing a few lines in the docker.yml config-like file).

tests/docker-stacks-foundation/test_packages.py Outdated Show resolved Hide resolved
@mathbunnyru
Copy link
Member

@mathbunnyru, @consideRatio, @yuvipanda, and @manics, please vote 👍 to accept this change and 👎 not to accept it (use a reaction to this message)
The voting deadline is the 11th of April (a month since I posted this message).
The change is accepted, if there are at least 2 positive votes.

We can have a discussion until the deadline, so please express your opinions.

As this is very similar to the pytorch-notebook, I won't wait until the deadline, if there are 2 positive votes before it.

@consideRatio
Copy link
Collaborator

I voted 👀 for now, I'd like to see that this seems reasonable to install and maintain long term by tensorflow gpu packages upstream make it easy enough, because it has been a mess historically in my experience and I don't want this project to take on maintaining function if its too messy. If the implementation looks not-messy, I'd be 👍, but I think this may be a very notable commitment if it is, and that we better then protect the project limited maintenance capacity from taking on such maintenance burden.

@mathbunnyru
Copy link
Member

I cleaned aarch64 machines, builds should work better - unfortunately, docker is the worst in cleaning its cache.

@manics
Copy link
Contributor

manics commented Mar 12, 2024

I don't feel qualified to give a 👍 or 👎, @consideRatio has already highlighted the main issues around long-term maintainability so I think the decision should be from those who have ultimate responsibility for maintaining it.

@ChristofKaufmann
Copy link
Contributor Author

I spend some time to find the best way regarding maintainability. Now it looks quite similar to the PyTorch cuda variant, except that for TensorFlow:

@mathbunnyru
Copy link
Member

You can't choose between CUDA 11 and 12 – it will use the officially tested build configuration.

Why can't you pin the version when using pip?

@ChristofKaufmann
Copy link
Contributor Author

The extra "and-cuda" is defined here. There the package versions of the dependencies are fixed to the ones listed in tested build configuration. There is no additional "and-cuda11" extra. That's what I meant.

@mathbunnyru
Copy link
Member

The extra "and-cuda" is defined here. There the package versions of the dependencies are fixed to the ones listed in tested build configuration. There is no additional "and-cuda11" extra. That's what I meant.

I guess there are many libraries installed alongside with tensorflow.
They probably either have cu12 in their name or in their version.
We can limit the version of one of such libraries and that's why pip will have to choose a proper cuda version.

At least I think it is worth trying.

@ChristofKaufmann
Copy link
Contributor Author

I tried to use the -cu11 packages (except for nvidia-nvjitlink-cu12, since it is new in CUDA 12) using:
CUDA11_DEPS=$(wget -qO- https://pypi.org/pypi/tensorflow/json | grep -o -e '[a-z_-]\+-cu12' | sed 's/-cu12/-cu11/; s/nvidia-nvjitlink-cu11//' | xargs)

However, it does not work. The tensorflow package is linked against CUDA 12 libraries. It expects e. g. libcudart.so.12, while there is a libcudart.so.11.

Full import errors with TF_CPP_MAX_VLOG_LEVEL=3
2024-03-15 02:38:23.246671: I external/local_tsl/tsl/platform/default/dso_loader.cc:70] Could not load dynamic library 'libcudart.so.12'; dlerror: libcudart.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/python3.11/site-packages/nvidia/cublas/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_cupti/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvcc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvrtc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_runtime/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cudnn/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cufft/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/curand/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusolver/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusparse/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/nccl/lib/
2024-03-15 02:38:23.246811: I external/local_tsl/tsl/platform/default/dso_loader.cc:70] Could not load dynamic library 'libcublas.so.12'; dlerror: libcublas.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/python3.11/site-packages/nvidia/cublas/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_cupti/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvcc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvrtc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_runtime/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cudnn/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cufft/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/curand/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusolver/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusparse/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/nccl/lib/
2024-03-15 02:38:23.246910: I external/local_tsl/tsl/platform/default/dso_loader.cc:70] Could not load dynamic library 'libcublasLt.so.12'; dlerror: libcublasLt.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/python3.11/site-packages/nvidia/cublas/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_cupti/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvcc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvrtc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_runtime/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cudnn/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cufft/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/curand/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusolver/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusparse/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/nccl/lib/
2024-03-15 02:38:23.247014: I external/local_tsl/tsl/platform/default/dso_loader.cc:70] Could not load dynamic library 'libcufft.so.11'; dlerror: libcufft.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/python3.11/site-packages/nvidia/cublas/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_cupti/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvcc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvrtc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_runtime/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cudnn/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cufft/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/curand/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusolver/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusparse/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/nccl/lib/
2024-03-15 02:38:23.261102: I external/local_tsl/tsl/platform/default/dso_loader.cc:59] Successfully opened dynamic library libcusolver.so.11
2024-03-15 02:38:23.261261: I external/local_tsl/tsl/platform/default/dso_loader.cc:70] Could not load dynamic library 'libcusparse.so.12'; dlerror: libcusparse.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/python3.11/site-packages/nvidia/cublas/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_cupti/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvcc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvrtc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_runtime/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cudnn/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cufft/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/curand/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusolver/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusparse/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/nccl/lib/
2024-03-15 02:38:23.261470: I external/local_tsl/tsl/platform/default/dso_loader.cc:59] Successfully opened dynamic library libcudnn.so.8
2024-03-15 02:38:23.261483: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[]

If you look at the conda-forge tensorflow-gpu package files, there are e. g. cuda120 (12.0) and cuda118 (11.8) builds. So the library version is fixed at build time and the PyPI tensorflow package provides only a CUDA 12 build.

@mathbunnyru
Copy link
Member

I tried to use the -cu11 packages (except for nvidia-nvjitlink-cu12, since it is new in CUDA 12) using:

CUDA11_DEPS=$(wget -qO- https://pypi.org/pypi/tensorflow/json | grep -o -e '[a-z_-]\+-cu12' | sed 's/-cu12/-cu11/; s/nvidia-nvjitlink-cu11//' | xargs)

However, it does not work. The tensorflow package is linked against CUDA 12 libraries. It expects e. g. libcudart.so.12, while there is a libcudart.so.11.

Full import errors with TF_CPP_MAX_VLOG_LEVEL=3

2024-03-15 02:38:23.246671: I external/local_tsl/tsl/platform/default/dso_loader.cc:70] Could not load dynamic library 'libcudart.so.12'; dlerror: libcudart.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/python3.11/site-packages/nvidia/cublas/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_cupti/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvcc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvrtc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_runtime/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cudnn/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cufft/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/curand/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusolver/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusparse/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/nccl/lib/

2024-03-15 02:38:23.246811: I external/local_tsl/tsl/platform/default/dso_loader.cc:70] Could not load dynamic library 'libcublas.so.12'; dlerror: libcublas.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/python3.11/site-packages/nvidia/cublas/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_cupti/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvcc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvrtc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_runtime/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cudnn/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cufft/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/curand/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusolver/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusparse/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/nccl/lib/

2024-03-15 02:38:23.246910: I external/local_tsl/tsl/platform/default/dso_loader.cc:70] Could not load dynamic library 'libcublasLt.so.12'; dlerror: libcublasLt.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/python3.11/site-packages/nvidia/cublas/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_cupti/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvcc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvrtc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_runtime/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cudnn/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cufft/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/curand/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusolver/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusparse/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/nccl/lib/

2024-03-15 02:38:23.247014: I external/local_tsl/tsl/platform/default/dso_loader.cc:70] Could not load dynamic library 'libcufft.so.11'; dlerror: libcufft.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/python3.11/site-packages/nvidia/cublas/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_cupti/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvcc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvrtc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_runtime/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cudnn/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cufft/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/curand/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusolver/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusparse/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/nccl/lib/

2024-03-15 02:38:23.261102: I external/local_tsl/tsl/platform/default/dso_loader.cc:59] Successfully opened dynamic library libcusolver.so.11

2024-03-15 02:38:23.261261: I external/local_tsl/tsl/platform/default/dso_loader.cc:70] Could not load dynamic library 'libcusparse.so.12'; dlerror: libcusparse.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/conda/lib/python3.11/site-packages/nvidia/cublas/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_cupti/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvcc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_nvrtc/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cuda_runtime/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cudnn/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cufft/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/curand/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusolver/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/cusparse/lib/:/opt/conda/lib/python3.11/site-packages/nvidia/nccl/lib/

2024-03-15 02:38:23.261470: I external/local_tsl/tsl/platform/default/dso_loader.cc:59] Successfully opened dynamic library libcudnn.so.8

2024-03-15 02:38:23.261483: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.

Skipping registering GPU devices...

[]

If you look at the conda-forge tensorflow-gpu package files, there are e. g. cuda120 (12.0) and cuda118 (11.8) builds. So the library version is fixed at build time and the PyPI tensorflow package provides only a CUDA 12 build.

Thanks. I guess in that case let's rename the variant to simple cuda because we don't have any control over the cuda version. And please update the docs to mention it.

@mathbunnyru
Copy link
Member

#2100 (comment)

@yuvipanda what do you think about this PR?

@twalcari
Copy link
Contributor

While I have no official vote here, I would like to express my full support for this PR. Given that TensorFlow is mainly used for GPU-accelerated applications, it makes a lot of sense to have a GPU-capable docker image available.

The current non-GPU enabled images feel like a neutered alternative that are OK for doing some preliminary exploration on how TensorFlow works, but are ineffective to be used in any real-world applications.

@yuvipanda
Copy link
Contributor

Sorry for the delay, @mathbunnyru.

I'm +1 on this change because it's using the upstream supported way to install tensorflow - the and-cuda variant (as described in https://www.tensorflow.org/install/pip#linux).

The only (non-blocking) concern I have is that it's based on the scipy-notebook image, which installs packages primarily from conda-forge. And tensorflow has some dependencies (particularly numpy) that are already in the base image. So the question is, what happens if a newer (or older) version of numpy is required by tensorflow than what we get from conda-forge? Would mixing pip and conda like this cause issues? In my experience, it mostly does not (I literally did this with tensorflow in another project a few months ago). And I'd rather us do this if it means we can directly use the method maintained by upstream. It's also what we do for pytorch now.

So overall, +1 from me. Thank you for this contribution, @ChristofKaufmann! And thanks for your stewardship, @mathbunnyru

@mathbunnyru
Copy link
Member

So the question is, what happens if a newer (or older) version of numpy is required by tensorflow than what we get from conda-forge?

We can always pin versions in some images if we need to and there is no other choice.

I think numpy is so widely used by everyone, so the conda-forge team puts lots of effort into releasing new versions and we won't even have to wait long.
But we'll only see this in practice, when we merge, and gain some experience.

So, let's try to merge this one 🙂

@mathbunnyru mathbunnyru merged commit b9553a8 into jupyter:main Mar 26, 2024
74 checks passed
@yuvipanda
Copy link
Contributor

But we'll only see this in practice, when we merge, and gain some experience.

Big big +1! Thank you :)

@ChristofKaufmann
Copy link
Contributor Author

Thank you for your helpful comments to improve the code @mathbunnyru

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add container images for the GPU version of TensorFlow and PyTorch Notebook
6 participants