-
Notifications
You must be signed in to change notification settings - Fork 1.4k
CUDA Device Binding Runtime Error When Running GPT-3 in Multi-Node Mode Using Slurm #3123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@glara76 We are encouraging users to use the LLM API for inference purpose which provide a more convenient user-interface for the end-users. The documentation will be refined to encourage this also. @Superjomn @hchings can you help provide the concrete suggestions to users to run multi-node inference based on LLM API? Also cc @laikhtewari since we discussed the documentation refinement plan before also. Thanks |
|
Thank you for your reply. Since there were many works based on 1 node last year (GPT-3 on tensorrt-llm platform), we are now aiming to extend those implementations to multi-node environments using the same foundation. We followed the guidance starting from this documentation (https://nvidia.github.io/TensorRT-LLM/architecture/core-concepts.html) and assumed that multi-node support within the TensorRT-LLM environment using Slurm was part of the intended use case. We believe NVIDIA may already have a working internal setup for this approach. Additionally, several of our users are working with pre-configured Docker containers in our current environment, so we would prefer to retain this setup if possible rather than shift to a new API or deployment model immediately. |
Got it. Can you try to make your multi-node run flow the same as here to help nail down the potential issue firstly? @jinyangyuan-nvidia Hi Jinyang, when you are convenient, can you help take a quick look of the issue reported by the community user? Thanks |
The multinode run flow is shown below, let me know if you need anything else.
sbatch_multi_run_gpt.sh export layer=1 export tag="${layer}layers${tp_size}gpu" models=( ....
....) ./generate_input_file.py -b ${batch} -s 128 for model in "${models[@]}"; do
done ** ./multi_run_profile.sh ** case $SLURMD_NODENAME in echo "====================== DEBUG INFO ======================"
Thanks |
Hi @glara76, can you help verify the value of I encountered a different error when trying to run a model on two nodes with 4 GPUs per node. The problem can be solved by changing |
Hi, @jinyangyuan-nvidia, thank you for your reply. |
After resolving a separate issue, I've encountered the same "invalid device ordinal" error again while running TensorRT-LLM in a SLURM multinode environment. If code modification inside TensorRT-LLM is required, I would appreciate it if you could let me know which specific part needs to be changed. Below is the log output when the job is submitted using sbatch:
$ sbatch sbatch_multi_run_gpt.sh
#!/bin/bash
#SBATCH --account=jychoi
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --partition=debug
#SBATCH --job-name=multinode_GPT3
#SBATCH --time=1:00:00
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --output=gpt_run_%j.log
srun --mpi=pmix --ntasks=${tp_size} --ntasks-per-node=4 --nodes=2 \
--export=ALL,gpt_model=${gpt_model},tp_size=${tp_size} \
./multi_run_profile.sh
- **multi_run_profile.sh** :
#!/bin/bash
source gpt_config.sh
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=0
export NCCL_NET_GDR_LEVEL=2
export CUDA_VISIBLE_DEVICES=0,1,2,3
export CUDA_DEVICE=$SLURM_LOCALID
case $SLURMD_NODENAME in
slurm1-05)
export NCCL_SOCKET_IFNAME=ens21f0
;;
slurm2-03)
export NCCL_SOCKET_IFNAME=eno8403
;;
*)
echo "[ERROR] Unknown node $SLURMD_NODENAME"
exit 1
;;
esac
nsys_result_path="nsys_results/nsys_${gpt_model}"
nsys_filename="report_${gpt_model}_${batch}batch_${SLURM_LOCALID}.nsys-rep"
echo "[Rank $SLURM_PROCID] Running multi_run_profile.sh at: slurm1-05 $(realpath $0)"
#export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}
# We do not need/cannot use CUDA_VISIBLE_DEVICES, because TRTLLM explicitly select it's device
# [[nodiscard]] SizeType32 getDevice() const noexcept
# {
# return mDeviceIds[mRank % getGpusPerGroup()];
# }
echo "====================== DEBUG INFO ======================"
#echo "[Rank $SLURM_PROCID] LocalID=$SLURM_LOCALID"
echo "[Rank $SLURM_PROCID] Host=$(hostname)"
#echo "[Rank $SLURM_PROCID] Node=$SLURMD_NODENAME"
#echo "[Rank $SLURM_PROCID] CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
#echo "[Rank $SLURM_PROCID] torch sees $(python3 -c 'import torch; print(torch.cuda.device_count())') GPUs"
echo "========================================================"
python3 ../run.py --engine_dir ${gpt_model}/trt_engines/fp16/${tp_size}-gpu \
--tokenizer_dir gpt \
--input_file input.txt \
--max_output_len 1024 \
--max_input_length 2048 \
--output_csv output.csv \
--run_profiling |
Hi @glara76, according to the above logs, it seems that you are using v0.9. This version is not maintained any more. Can you try with the latest main branch and see whether the problem still exists? After building engines, it is suggested to check |
System Info
2 nodes configuration (4*A100 GPU per node)
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
#!/bin/bash
#SBATCH --account=rainbow
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --partition=debug
#SBATCH --job-name=multinode_GPT3
#SBATCH --time=1:00:00
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --output=gpt_run_%j.log
srun --mpi=pmix --ntasks=${tp_size} --ntasks-per-node=4 --nodes=2
--export=ALL,gpt_model=${gpt_model},tp_size=${tp_size}
./multi_run_profile.sh
source gpt_config.sh
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=0
export NCCL_NET_GDR_LEVEL=2
export CUDA_VISIBLE_DEVICES=$SLURM_LOCALID
case $SLURMD_NODENAME in
slurm1-04)
export NCCL_SOCKET_IFNAME=ens21f0
;;
slurm2-03)
export NCCL_SOCKET_IFNAME=eno8403
;;
*)
echo "[ERROR] Unknown node $SLURMD_NODENAME"
exit 1
;;
esac
echo "====================== DEBUG INFO ======================" (hostname)" (python3 -c 'import torch; print(torch.cuda.device_count())') GPUs"
echo "[Rank $SLURM_PROCID] LocalID=$SLURM_LOCALID"
echo "[Rank
echo "[Rank $SLURM_PROCID] Node=$SLURMD_NODENAME"
echo "[Rank $SLURM_PROCID] CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
echo "[Rank
echo "========================================================"
Expected behavior
Run TP8 GPT-3 on 2 nodes (total A100*8GPU)
actual behavior
Traceback (most recent call last):
File "/host/TensorRT-LLM/examples/gpt/../run.py", line 579, in
main(args)
File "/host/TensorRT-LLM/examples/gpt/../run.py", line 428, in main
runner = runner_cls.from_dir(**runner_kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 182, in from_dir
session = GptSession(config=session_config,
RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in cudaSetDevice(device): invalid device ordinal
additional notes
I successfully ran TensorRT-LLM on a single node with tensor parallelism(TP=4) without using Slurm, and everything worked fine. Now, after installing and configuring Slurm for multi-node execution, I'm encountering the following issue:
RuntimeError: CUDA error: invalid device ordinal
This error appears when torch.cuda.set_device() is called.
I believe this is due to the GPU device visibility being limited by Slurm, but I couldn’t find clear guidance on how to configure torch.cuda.set_device() properly under Slurm.
How should device selection be handled in TensorRT-LLM when running in a Slurm-managed multi-node environment?
Is there a recommended way to set the CUDA device per rank while respecting Slurm’s GPU binding?
The text was updated successfully, but these errors were encountered: