-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepSpeed Installation Fails During Docker Build (NVML Initialization Issue) #6945
Comments
Hi @asdfry - The errors appear to be from gcc, perhaps the gcc versions are different and causing issues?
Also some of the warnings clouding the output are from not having py-cpuinfo installed, could you add that and share the log again? |
Hi @asdfry - following up on this, could you share the full dockerfile that you're using so we can repro? |
Hello, thank you for continuing to follow up on this. FROM nvcr.io/nvidia/pytorch:23.01-py3
SHELL ["/bin/bash", "-c"]
USER root
WORKDIR /root
ENV DEBIAN_FRONTEND=noninteractive
# Set env for torch (compute capability)
ENV TORCH_CUDA_ARCH_LIST=9.0
# Install packages
RUN apt update && \
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash && \
apt install -y git-lfs pdsh openssh-server net-tools tmux tree libaio-dev iputils-ping iproute2 libnvidia-compute-535
# Set for installation
ENV mlnx_image=MLNX_OFED_LINUX-23.10-3.2.2.0-ubuntu20.04-x86_64
ENV hpcx_image=hpcx-v2.18.1-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64
# Install mlnx ofed
RUN wget http://www.mellanox.com/downloads/ofed/MLNX_OFED-23.10-3.2.2.0/$mlnx_image.tgz && \
tar -xvf $mlnx_image.tgz && \
rm $mlnx_image.tgz && \
./$mlnx_image/mlnxofedinstall --user-space-only --without-fw-update -q
# Install hpc-x
RUN wget http://www.mellanox.com/downloads/hpc/hpc-x/v2.18.1/$hpcx_image.tbz && \
tar -xvf $hpcx_image.tbz && \
rm $hpcx_image.tbz
ENV HPCX_HOME=/root/$hpcx_image
# Install python & pip and Install libraries
ENV DS_BUILD_CPU_ADAM=1
COPY requirements.txt requirements.txt
RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && \
python3 get-pip.py && \
pip install --no-cache-dir -r requirements.txt
# Copy files that required for training
COPY source .
COPY configs configs
|
Hi @asdfry - thanks for the repro case. I'm able to repro it. I know that where we've used these docker base images, we started using them with 23.03, this version seems to work as well as other newer versions. It looks like a gcc error, but I'm not sure what the specific issue is. I was able to repro with an even smaller dockerfile, removing Mellanox and HPC-x too.
Are you able to use a newer version of the pytorch nvcr container to resolve the problem? |
As I mentioned in my first question, I confirmed that updating the base image version resolves the issue with installing Deepspeed. [email protected]:~$ docker run -it --rm nvcr.io/nvidia/pytorch:23.01-py3 /bin/bash
=============
== PyTorch ==
=============
NVIDIA Release 23.01 (build 52269074)
PyTorch Version 1.14.0a0+44dac51
Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2014-2023 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use the NVIDIA Container Toolkit to start this container with GPU support; see
https://docs.nvidia.com/datacenter/cloud-native/ .
NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for PyTorch. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...
root@84ec7a9ea1c9:/workspace# pip install deepspeed
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting deepspeed
Downloading deepspeed-0.16.3.tar.gz (1.4 MB)
|████████████████████████████████| 1.4 MB 10.5 MB/s
Collecting einops
Downloading einops-0.8.0-py3-none-any.whl (43 kB)
|████████████████████████████████| 43 kB 15.0 MB/s
Collecting hjson
Downloading hjson-3.1.0-py3-none-any.whl (54 kB)
|████████████████████████████████| 54 kB 14.5 MB/s
Requirement already satisfied: msgpack in /usr/local/lib/python3.8/dist-packages (from deepspeed) (1.0.4)
Collecting ninja
Downloading ninja-1.11.1.3-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB)
|████████████████████████████████| 422 kB 11.4 MB/s
Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (from deepspeed) (1.22.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.8/dist-packages (from deepspeed) (22.0)
Requirement already satisfied: psutil in /usr/local/lib/python3.8/dist-packages (from deepspeed) (5.9.4)
Collecting py-cpuinfo
Downloading py_cpuinfo-9.0.0-py3-none-any.whl (22 kB)
Collecting pydantic>=2.0.0
Downloading pydantic-2.10.5-py3-none-any.whl (431 kB)
|████████████████████████████████| 431 kB 11.0 MB/s
Requirement already satisfied: torch in /usr/local/lib/python3.8/dist-packages (from deepspeed) (1.14.0a0+44dac51)
Requirement already satisfied: tqdm in /usr/local/lib/python3.8/dist-packages (from deepspeed) (4.64.1)
Collecting annotated-types>=0.6.0
Downloading annotated_types-0.7.0-py3-none-any.whl (13 kB)
Collecting typing-extensions>=4.12.2
Downloading typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Collecting pydantic-core==2.27.2
Downloading pydantic_core-2.27.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
|████████████████████████████████| 2.0 MB 11.1 MB/s
Requirement already satisfied: sympy in /usr/local/lib/python3.8/dist-packages (from torch->deepspeed) (1.11.1)
Requirement already satisfied: networkx in /usr/local/lib/python3.8/dist-packages (from torch->deepspeed) (2.6.3)
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.8/dist-packages (from sympy->torch->deepspeed) (1.2.1)
Building wheels for collected packages: deepspeed
Building wheel for deepspeed (setup.py) ... done
Created wheel for deepspeed: filename=deepspeed-0.16.3-py3-none-any.whl size=1549958 sha256=057c552cf5b248514aa2293002f22da03d8d6e651a73141ac8ebf19d2c59c77e
Stored in directory: /tmp/pip-ephem-wheel-cache-au9njj9c/wheels/72/85/51/65020b7f481c0b9e013a823b05be2d297ab81a1627d4cb8666
Successfully built deepspeed
Installing collected packages: typing-extensions, pydantic-core, annotated-types, pydantic, py-cpuinfo, ninja, hjson, einops, deepspeed
Attempting uninstall: typing-extensions
Found existing installation: typing-extensions 4.4.0
Uninstalling typing-extensions-4.4.0:
Successfully uninstalled typing-extensions-4.4.0
Attempting uninstall: pydantic
Found existing installation: pydantic 1.10.4
Uninstalling pydantic-1.10.4:
Successfully uninstalled pydantic-1.10.4
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
thinc 8.1.7 requires pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4, but you have pydantic 2.10.5 which is incompatible.
spacy 3.5.0 requires pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4, but you have pydantic 2.10.5 which is incompatible.
confection 0.0.4 requires pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4, but you have pydantic 2.10.5 which is incompatible.
Successfully installed annotated-types-0.7.0 deepspeed-0.16.3 einops-0.8.0 hjson-3.1.0 ninja-1.11.1.3 py-cpuinfo-9.0.0 pydantic-2.10.5 pydantic-core-2.27.2 typing-extensions-4.12.2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 21.2.4; however, version 24.3.1 is available.
You should consider upgrading via the '/usr/bin/python -m pip install --upgrade pip' command.
root@84ec7a9ea1c9:/workspace# ds_report
[2025-01-23 05:17:37,999] [WARNING] [real_accelerator.py:181:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-01-23 05:17:38,008] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cpu (auto detect)
df: /root/.triton/autotune: No such file or directory
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
deepspeed_not_implemented [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
deepspeed_ccl_comm ..... [NO] ....... [OKAY]
deepspeed_shm_comm ..... [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.14.0a0+44dac51
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.16.3, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.14
shared memory (/dev/shm) size .... 64.00 MB
[WARNING] /dev/shm size might be too small, if running in docker increase to at least --shm-size='1gb' |
Hi @asdfry - thanks, I saw you acknowledged that updating the version worked, but I didn't know of the hard requirement on the Nvidia driver version. I believe the reason this is working in the docker container is that you are not pre-compiling the DS_BUILD_CPU_ADAM op that is being built in the dockerfile. If I run:
I am able to repro the failure in the docker image. I'll take a look at what else could be wrong. |
@asdfry - I believe the issue is that CPU_ADAM needs to be compiled with I was able to use this docker image, update torch to 2.2 (for example) and build successfully. I believe that should unblock you unless you are also bound by the torch version? |
Hello, Thank you so much for taking the time to provide such detailed answers to my question. Wishing you a wonderful day! |
Hello,
I encountered an issue while building a Docker image for deep learning model training, specifically when attempting to install DeepSpeed.
Issue
When building the Docker image, the DeepSpeed installation fails with a warning that NVML initialization is not possible.
However, if I create a container from the same image and install DeepSpeed inside the container, the installation works without any issues.
Environment
Base Image:
nvcr.io/nvidia/pytorch:23.01-py3
DeepSpeed Version:
0.16.2
Build Log
docker_build.log
Additional Context
The problem does not occur with the newer base image
nvcr.io/nvidia/pytorch:24.05-py3
.Thank you.
The text was updated successfully, but these errors were encountered: