PyTorch-based utility to recommend env vars. #252

verdimrc · 2024-04-10T09:38:23Z

Need a PyTorch script to probe instances and software stack, then prints the necessary EFA env vars. Use this script to avoid spending hours to debug runtime crash caused by "forgetting to set env var XXX when using a combination of libXXX-versionA and libYYY-versionB

It can be a plain Python script, as long as it can accurately probe the version of nccl, aws-ofi-nccl (nccl-net), libfabric, and efa version (dlopen or what not). The probing mechanic must be robust to whether the script is run under container (vs) host, whether env vars are overriden by job script, etc.

At the very least, the script must covers the env vars in EFA cheatsheet, +new ones (like the cuda mem sync env var for incompatibility between specific versions of nccl and efa installer).

The text was updated successfully, but these errors were encountered:

sean-smith · 2024-04-10T15:25:45Z

Does it need to be a Pytorch script or can it just read from env? If the latter we can start from efa-versions.sh and add recommended versions based on instance type.

verdimrc · 2024-04-11T02:16:01Z

Need PyTorch script. It serves a different purpose that the efa-versions.sh (which is to probe probing what's available on disk). Rather, it should use the libraries that PyTorch script will actually use (dlopen, or whatever) -- which can be (e.g.,) nccl in the pip/conda environment (not the DLAMI version), or in a container, or even when user override with LD_* env vars..

verdimrc · 2024-04-16T06:53:40Z

Here's the difference between what PyTorch actually uses (vs) what the AMI pre-installed system-wide.

My DLAMI provides cuda-12.1 (default) with nccl-2.18.5. Simply probing what's installed system wide is useful (or should I say limited?) to the pre-built nccl-tests, or any applications compiled against this nccl.

On the other hand, prebuilt PyTorch ships with its own nccl. We can see from the version printed (2, 19, 3), and the actual path to libncc.so.

$ env | grep ^LD
LD_LIBRARY_PATH=/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/cuda-12.1/lib:/usr/local/cuda-12.1/lib64:/usr/local/cuda-12.1:/usr/local/cuda-12.1/targets/x86_64-linux/lib/:/usr/local/cuda-12.1/extras/CUPTI/lib64:/usr/local/lib:/usr/lib

$ strings /usr/local/cuda/lib/libnccl.so | grep -m1 -i '^NCCL version .*\+cuda.*$'
NCCL version 2.18.5+cuda12.2

$ source miniconda3/bin/activate ./pt_fsdp_haha/
(conda_env_name) $ env | grep ^LD
LD_LIBRARY_PATH=/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/cuda-12.1/lib:/usr/local/cuda-12.1/lib64:/usr/local/cuda-12.1:/usr/local/cuda-12.1/targets/x86_64-linux/lib/:/usr/local/cuda-12.1/extras/CUPTI/lib64:/usr/local/lib:/usr/lib

(conda_env_name) $ python -c 'import torch; print(f"{torch.cuda.nccl.version()=}")'
torch.cuda.nccl.version()=(2, 19, 3)

(conda_env_name) $ LD_DEBUG=libs python -c 'import torch; print(f"{torch.cuda.nccl.version()=}")' 2>&1 | egrep 'torch.cuda.nccl.version|libnccl.so'
     47352:     find library=libnccl.so.2 [0]; searching
     47352:       trying file=/fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp_haha/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libnccl.so.2
     47352:       trying file=/fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp_haha/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_cupti/lib/libnccl.so.2
     47352:       trying file=/fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp_haha/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_nvrtc/lib/libnccl.so.2
     47352:       trying file=/fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp_haha/lib/python3.10/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libnccl.so.2
     47352:       trying file=/fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp_haha/lib/python3.10/site-packages/torch/lib/../../nvidia/cudnn/lib/libnccl.so.2
     47352:       trying file=/fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp_haha/lib/python3.10/site-packages/torch/lib/../../nvidia/cufft/lib/libnccl.so.2
     47352:       trying file=/fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp_haha/lib/python3.10/site-packages/torch/lib/../../nvidia/curand/lib/libnccl.so.2
     47352:       trying file=/fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp_haha/lib/python3.10/site-packages/torch/lib/../../nvidia/cusolver/lib/libnccl.so.2
     47352:       trying file=/fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp_haha/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libnccl.so.2
     47352:       trying file=/fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp_haha/lib/python3.10/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2
     47352:     calling init: /fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp_haha/lib/python3.10/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2
torch.cuda.nccl.version()=(2, 19, 3)
     47352:     calling fini: /fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/pt_fsdp_haha/lib/python3.10/site-packages/torch/lib/../../nvidia/nccl/lib/libnccl.so.2 [0]

sean-smith · 2024-04-25T22:04:26Z

This is also related to #284

github-actions · 2024-07-25T01:50:55Z

This issue is stale because it has been open for 30 days with no activity.

github-actions · 2024-09-23T01:59:09Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

verdimrc mentioned this issue Apr 16, 2024

Script to probe the nccl libraries that PyTorch uses #267

Closed

github-actions bot added the stale label Jul 25, 2024

github-actions bot closed this as completed Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch-based utility to recommend env vars. #252

PyTorch-based utility to recommend env vars. #252

verdimrc commented Apr 10, 2024 •

edited

Loading

sean-smith commented Apr 10, 2024

verdimrc commented Apr 11, 2024 •

edited

Loading

verdimrc commented Apr 16, 2024 •

edited

Loading

sean-smith commented Apr 25, 2024

github-actions bot commented Jul 25, 2024

github-actions bot commented Sep 23, 2024

PyTorch-based utility to recommend env vars. #252

PyTorch-based utility to recommend env vars. #252

Comments

verdimrc commented Apr 10, 2024 • edited Loading

sean-smith commented Apr 10, 2024

verdimrc commented Apr 11, 2024 • edited Loading

verdimrc commented Apr 16, 2024 • edited Loading

sean-smith commented Apr 25, 2024

github-actions bot commented Jul 25, 2024

github-actions bot commented Sep 23, 2024

verdimrc commented Apr 10, 2024 •

edited

Loading

verdimrc commented Apr 11, 2024 •

edited

Loading

verdimrc commented Apr 16, 2024 •

edited

Loading