Skip to content

Commit

Permalink
Update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
Verdi March committed Apr 19, 2024
1 parent b20f440 commit 07a509c
Show file tree
Hide file tree
Showing 3 changed files with 143 additions and 7 deletions.
Original file line number Diff line number Diff line change
@@ -1,8 +1,145 @@
Pre-requisite:
# Runtime sanity checks: which NCCL is loaded by PyTorch

- must run on GPU instance. The scripts requires exactly one GPU.
- make sure the `python` command points to the binary from your PyTorch environment. Usually you need to activate your environment (conda, virtual env, etc.).
The scripts in this folder disambiguate the exact NCCL library that a PyTorch application actually
loads (i.e., dlopen), in the presence of potentially multiple NCCL libraries installed.

## 1. Motivation

Knowing the precise NCCL version is important for performance tuning. The exact NCCL library used by
PyTorch depends on how the PyTorch was installed. You can install multiple versions of PyTorch in
your AMI or Docker image, and each PyTorch may use a different NCCL library.

Consider this use case for illustration. Suppose your EC2 instance runs DLAMI `Deep Learning Base
OSS Nvidia Driver GPU AMI (Ubuntu 20.04) 20240314`. This AMI provides `libnccl.so` version 2.18.5
under `/usr/lib/cuda` ([Section 3.1](#31-system-wide-nccl-provided-by-dlami)). You then install
PyTorch to a conda or a regular Python virtual environment, using the [prebuilt PyTorch
installer](https://pytorch.org/get-started/locally/) from the pytorch
[channel](https://anaconda.org/pytorch/repo) or [PyPI](https://pypi.org/project/torch/),
respectively.

It is misleading to assume that PyTorch always uses the NCCL-2.18.5 provided by the DLAMI. The
prebuilt PyTorch installs and uses its own NCCL library whose version can differ from what the AMI
provides system wide, as shown below. In fact, we further observe that in addition to NCCL, PyTorch
also brings its own CUDA runtime even when the DLAMI already provides multiple versions system-wide
under `/usr/local/cuda*`.

```console
$ conda list | egrep 'torch|nvidia'
nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi
nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi
nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi
nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi
nvidia-cudnn-cu12 8.9.2.26 pypi_0 pypi
nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi
nvidia-curand-cu12 10.3.2.106 pypi_0 pypi
nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi
nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi
nvidia-nccl-cu12 2.19.3 pypi_0 pypi
nvidia-nvjitlink-cu12 12.4.99 pypi_0 pypi
nvidia-nvtx-cu12 12.1.105 pypi_0 pypi
torch 2.2.1 pypi_0 pypi
torchaudio 2.2.1 pypi_0 pypi
torchvision 0.17.1 pypi_0 pypi
```

With multiple `libnccl.so`, how to ascertain which one that PyTorch actually uses? Is it the
system-wide NCCL (version 2.18.5 in our example DLAMI), or the one that's installed along PyTorch?

As it turns out, we can pinpoint exactly what NCCL library that PyTorch will load. In below
snippets, we can see that this particular PyTorch loads NCCL version `(2, 19, 3)` from the conda
environment.

```console
# Library search paths, without conda environment
$ env | grep ^LD
LD_LIBRARY_PATH=/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/cuda-12.1/lib:/usr/local/cuda-12.1/lib64:/usr/local/cuda-12.1:/usr/local/cuda-12.1/targets/x86_64-linux/lib/:/usr/local/cuda-12.1/extras/CUPTI/lib64:/usr/local/lib:/usr/lib

# Print version of the system-wide NCCL, pre-installed with DLAMI
$ strings /usr/local/cuda/lib/libnccl.so | grep -m1 -i '^NCCL version .*\+cuda.*$'
NCCL version 2.18.5+cuda12.2

# Activate the conda environment where PyTorch is located.
$ source miniconda3/bin/activate ./pt_fsdp_haha/

# Conda environment does not change the library search paths.
(conda_env_name) $ env | grep ^LD
LD_LIBRARY_PATH=/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/cuda-12.1/lib:/usr/local/cuda-12.1/lib64:/usr/local/cuda-12.1:/usr/local/cuda-12.1/targets/x86_64-linux/lib/:/usr/local/cuda-12.1/extras/CUPTI/lib64:/usr/local/lib:/usr/lib

# Query the nccl version to PyTorch
(conda_env_name) $ python -c 'import torch; print(f"{torch.cuda.nccl.version()=}")'
torch.cuda.nccl.version()=(2, 19, 3)

# Similar to above, but also pinpoint the exact `libnccl.so` file.
(conda_env_name) $ LD_DEBUG=libs python -c 'import torch; print(f"{torch.cuda.nccl.version()=}")' 2>&1 | egrep 'torch.cuda.nccl.version|libnccl.so'
47352: find library=libnccl.so.2 [0]; searching
47352: trying file=/fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp_haha/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libnccl.so.2
47352: trying file=/fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp_haha/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_cupti/lib/libnccl.so.2
47352: trying file=/fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp_haha/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_nvrtc/lib/libnccl.so.2
47352: trying file=/fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp_haha/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libnccl.so.2
47352: trying file=/fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp_haha/lib/python3.10/site-packages/torch/lib/../../nvidia/cudnn/lib/libnccl.so.2
47352: trying file=/fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp_haha/lib/python3.10/site-packages/torch/lib/../../nvidia/cufft/lib/libnccl.so.2
47352: trying file=/fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp_haha/lib/python3.10/site-packages/torch/lib/../../nvidia/curand/lib/libnccl.so.2
47352: trying file=/fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp_haha/lib/python3.10/site-packages/torch/lib/../../nvidia/cusolver/lib/libnccl.so.2
47352: trying file=/fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp_haha/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libnccl.so.2
47352: trying file=/fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp_haha/lib/python3.10/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2
47352: calling init: /fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp_haha/lib/python3.10/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2
torch.cuda.nccl.version()=(2, 19, 3)
47352: calling fini: /fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp_haha/lib/python3.10/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2 [0]
```

By now, hopefully you're convinced on the need for a runtime probe to pinpoint the version of NCCL
loaded by PyTorch.

## 2. Howto

Pre-requisites:

- one GPU.
- the command `python` will invoke the binary from your PyTorch environment. Typically, you must
activate your environment (conda, virtual env, etc.).

Direct invocation: `./probe-pt-nccl-aws-libs.sh`

Via Slurm: `srun -l -N1 ./probe-pt-nccl-aws-libs.sh`

## 3. Appendix

### 3.1. System-wide NCCL provided by DLAMI

See the [DLAMI release
notes](https://aws.amazon.com/releasenotes/aws-deep-learning-base-gpu-ami-ubuntu-20-04/) for the
versions of pre-installed [CUDA runtime](https://docs.nvidia.com/cuda/),
[NCCL](https://github.com/NVIDIA/nccl), and [aws-ofi-nccl](https://github.com/aws/aws-ofi-nccl).

Optionally, follow below steps to validate the versions on a live EC2 instance.

```console
# DLAMI provides multiple versions of CUDA runtime
$ ls -ald /usr/local/cuda-*/ /usr/local/cuda
lrwxrwxrwx 1 root root 21 Mar 14 22:22 /usr/local/cuda -> /usr/local/cuda-12.1/
drwxr-xr-x 17 root root 4096 Mar 14 22:01 /usr/local/cuda-11.7/
drwxr-xr-x 19 root root 4096 Mar 14 22:19 /usr/local/cuda-11.8/
drwxr-xr-x 19 root root 4096 Mar 14 22:13 /usr/local/cuda-12.0/
drwxr-xr-x 19 root root 4096 Mar 14 22:25 /usr/local/cuda-12.1/
drwxr-xr-x 19 root root 4096 Mar 14 22:07 /usr/local/cuda-12.2/

# Each CUDA version brings its own NCCL library
$ find /usr/local/cuda-* -name 'libnccl.so'
/usr/local/cuda-11.7/lib/libnccl.so
/usr/local/cuda-11.8/lib/libnccl.so
/usr/local/cuda-12.0/lib/libnccl.so
/usr/local/cuda-12.1/lib/libnccl.so
/usr/local/cuda-12.2/lib/libnccl.so

# Print NCCL versions
$ find /usr/local/cuda-* -name 'libnccl.so' | xargs -n1 -I{} bash -c "echo -n {} '=> ' ; strings {} | grep '^NCCL version' -m1"
/usr/local/cuda-11.7/lib/libnccl.so => NCCL version 2.16.2+cuda11.8
/usr/local/cuda-11.8/lib/libnccl.so => NCCL version 2.16.2+cuda11.8
/usr/local/cuda-12.0/lib/libnccl.so => NCCL version 2.18.5+cuda12.2
/usr/local/cuda-12.1/lib/libnccl.so => NCCL version 2.18.5+cuda12.2
/usr/local/cuda-12.2/lib/libnccl.so => NCCL version 2.18.5+cuda12.2

# Print aws-ofi-nccl version
$ strings /opt/aws-ofi-nccl/lib/libnccl-net.so | grep '^NET/OFI Initializing aws-ofi-nccl .*-aws'
NET/OFI Initializing aws-ofi-nccl 1.7.4-aws
```
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#!/usr/bin/env python

"""
All-reduce on one GPU, to probe libraries opened.
All-reduce on one GPU, to probe libraries opened. This script can be run without torchrun.
Usage:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,8 @@
set -uo pipefail
## NOTE: you must activate the python environment, or make sure the right python is in PATH

# Patch version of stas-00/ml-engineering/tree/main/network/benchmarks/all_reduce_bench.py, to get
# it runs without torchrun on a single GPU.
strace_output=$( MASTER_ADDR=localhost MASTER_PORT=0 MASTER_ADDR=localhost RANK=0 strace -e trace=open,openat -e status=successful python ./all_reduce_single_gpu.py 2>&1 )
# Call PyTorch collective on a single GPU, to force load libnccl.
strace_output=$( MASTER_ADDR=localhost MASTER_PORT=0 RANK=0 strace -e trace=open,openat -e status=successful python ./all_reduce_single_gpu.py 2>&1 )
retval="$?"
[[ "retval" -eq 0 ]] || { echo "$strace_output" ; exit "$retval" ; }

Expand Down

0 comments on commit 07a509c

Please sign in to comment.