CUDA Device Binding Runtime Error When Running GPT-3 in Multi-Node Mode Using Slurm #3123

glara76 · 2025-03-27T12:55:12Z

System Info

2 nodes configuration (4*A100 GPU per node)

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

sbatch script
#!/bin/bash
#SBATCH --account=rainbow
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --partition=debug
#SBATCH --job-name=multinode_GPT3
#SBATCH --time=1:00:00
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --output=gpt_run_%j.log

srun --mpi=pmix --ntasks=${tp_size} --ntasks-per-node=4 --nodes=2
--export=ALL,gpt_model=${gpt_model},tp_size=${tp_size}
./multi_run_profile.sh

multi_run_profile.sh
source gpt_config.sh
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=0
export NCCL_NET_GDR_LEVEL=2
export CUDA_VISIBLE_DEVICES=$SLURM_LOCALID

case $SLURMD_NODENAME in
slurm1-04)
export NCCL_SOCKET_IFNAME=ens21f0
;;
slurm2-03)
export NCCL_SOCKET_IFNAME=eno8403
;;
*)
echo "[ERROR] Unknown node $SLURMD_NODENAME"
exit 1
;;
esac

echo "====================== DEBUG INFO ======================"
echo "[Rank $SLURM_PROCID] LocalID=$SLURM_LOCALID"
echo "[Rank $S L U R M_{P} R O C I D] H o s t =$ (hostname)"
echo "[Rank $SLURM_PROCID] Node=$SLURMD_NODENAME"
echo "[Rank $SLURM_PROCID] CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
echo "[Rank $S L U R M_{P} R O C I D] t o r c h s e e s$ (python3 -c 'import torch; print(torch.cuda.device_count())') GPUs"
echo "========================================================"

    python3 ../run.py --engine_dir ${gpt_model}/trt_engines/fp16/${tp_size}-gpu \
    --tokenizer_dir gpt \
    --input_file input.txt \
    --max_output_len 1024 \
    --max_input_length 2048 \
    --output_csv output.csv \
    --run_profiling

Expected behavior

Run TP8 GPT-3 on 2 nodes (total A100*8GPU)

actual behavior

Error
Traceback (most recent call last):
File "/host/TensorRT-LLM/examples/gpt/../run.py", line 579, in
main(args)
File "/host/TensorRT-LLM/examples/gpt/../run.py", line 428, in main
runner = runner_cls.from_dir(**runner_kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 182, in from_dir
session = GptSession(config=session_config,
RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in cudaSetDevice(device): invalid device ordinal

additional notes

I successfully ran TensorRT-LLM on a single node with tensor parallelism(TP=4) without using Slurm, and everything worked fine. Now, after installing and configuring Slurm for multi-node execution, I'm encountering the following issue:

RuntimeError: CUDA error: invalid device ordinal
This error appears when torch.cuda.set_device() is called.

I believe this is due to the GPU device visibility being limited by Slurm, but I couldn’t find clear guidance on how to configure torch.cuda.set_device() properly under Slurm.

How should device selection be handled in TensorRT-LLM when running in a Slurm-managed multi-node environment?
Is there a recommended way to set the CUDA device per rank while respecting Slurm’s GPU binding?

The text was updated successfully, but these errors were encountered:

juney-nvidia · 2025-03-27T13:02:42Z

@glara76
Hi Glara76,

We are encouraging users to use the LLM API for inference purpose which provide a more convenient user-interface for the end-users.

The documentation will be refined to encourage this also.

@Superjomn @hchings can you help provide the concrete suggestions to users to run multi-node inference based on LLM API?

Also cc @laikhtewari since we discussed the documentation refinement plan before also.

Thanks
June

Superjomn · 2025-03-28T00:05:46Z

The documentation will be refined to encourage this also.
@juney-nvidia Sure, I am drafting a dedicated doc for this. will paste the MR here soon. @glara76

glara76 · 2025-03-28T00:23:03Z

Thank you for your reply. Since there were many works based on 1 node last year (GPT-3 on tensorrt-llm platform), we are now aiming to extend those implementations to multi-node environments using the same foundation. We followed the guidance starting from this documentation (https://nvidia.github.io/TensorRT-LLM/architecture/core-concepts.html) and assumed that multi-node support within the TensorRT-LLM environment using Slurm was part of the intended use case. We believe NVIDIA may already have a working internal setup for this approach. Additionally, several of our users are working with pre-configured Docker containers in our current environment, so we would prefer to retain this setup if possible rather than shift to a new API or deployment model immediately.
We appreciate your support and look forward to any documentation or guidance you can share regarding multi-node inference using the existing TensorRT-LLM.

juney-nvidia · 2025-03-28T06:08:25Z

Got it.

Can you try to make your multi-node run flow the same as here to help nail down the potential issue firstly?

@jinyangyuan-nvidia Hi Jinyang, when you are convenient, can you help take a quick look of the issue reported by the community user?

Thanks
June

glara76 · 2025-03-28T06:59:22Z

The multinode run flow is shown below, let me know if you need anything else.

sbatch sbatch_multi_run_gpt.sh

sbatch_multi_run_gpt.sh
#!/bin/bash
#SBATCH --account=jychoi
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --partition=debug
#SBATCH --job-name=multinode_GPT3
#SBATCH --time=1:00:00
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --output=gpt_run_%j.log

export layer=1
export tp_size=8
export batch=8

export tag="${layer}layers${tp_size}gpu"
export TRT_LOGGER_SEVERITY=VERBOSE

models=(

....

 "175B   12288 96"

....

)
export models

./generate_input_file.py -b ${batch} -s 128
echo "batch: ${batch}"

for model in "${models[@]}"; do
read name d_model n_heads <<< "${model}"
export gpt_model="gpt_${name}${tag}"

srun --mpi=pmix --ntasks=${tp_size} --ntasks-per-node=4 --nodes=2 \
            --export=ALL,gpt_model=${gpt_model},tp_size=${tp_size} \
        ./multi_run_profile.sh

done

** ./multi_run_profile.sh **
#!/bin/bash
source gpt_config.sh
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1
#export NCCL_SOCKET_IFNAME=ib0
#export NCCL_SOCKET_IFNAME=eno8403
export NCCL_P2P_DISABLE=0
export NCCL_NET_GDR_LEVEL=2
export CUDA_VISIBLE_DEVICES=$SLURM_LOCALID

case $SLURMD_NODENAME in
slurm1-04)
export NCCL_SOCKET_IFNAME=ens21f0
;;
slurm2-03)
export NCCL_SOCKET_IFNAME=eno8403
;;
*)
echo "[ERROR] Unknown node $SLURMD_NODENAME"
exit 1
;;
esac

echo "====================== DEBUG INFO ======================"
echo "[Rank $SLURM_PROCID] LocalID=$SLURM_LOCALID"
echo "[Rank $S L U R M_{P} R O C I D] H o s t =$ (hostname)"
echo "[Rank $SLURM_PROCID] Node=$SLURMD_NODENAME"
echo "[Rank $SLURM_PROCID] CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
echo "[Rank $S L U R M_{P} R O C I D] t o r c h s e e s$ (python3 -c 'import torch; print(torch.cuda.device_count())') GPUs"
echo "========================================================"

    python3 ../run.py --engine_dir ${gpt_model}/trt_engines/fp16/${tp_size}-gpu \
    --tokenizer_dir gpt \
    --input_file input.txt \
    --max_output_len 1024 \
    --max_input_length 2048 \
    --output_csv output.csv

Thanks

Superjomn · 2025-03-28T11:11:02Z

@glara76 Here is the document draft for running LLM-API script or other LLM-API-based tool such as trtllm-bench or trtllm-serve. Feel free to comment if you have any issues, and we will refine it.

jinyangyuan-nvidia · 2025-03-28T13:03:28Z

Hi @glara76, can you help verify the value of pretrained_config.mapping.gpus_per_node in config.json? If the value is 8, can you try to run the program again after changing it to 4? (It is suggested to also change build_config.auto_parallel_config.gpus_per_node just in case it is used in your use case)

I encountered a different error when trying to run a model on two nodes with 4 GPUs per node. The problem can be solved by changing pretrained_config.mapping.gpus_per_node to 4. Don't know whether it can solve your problem too.

glara76 · 2025-03-31T12:24:15Z

Hi, @jinyangyuan-nvidia, thank you for your reply.
I fixed the part you mentioned (build_config.auto_parallel_config.gpus_per_node=8-->4) and that error message went away for now. I'm not sure if it's gone completely though, as I'm still debugging it further as there are other error messages, so thanks for helping me move forward.

glara76 · 2025-04-04T08:38:33Z

After resolving a separate issue, I've encountered the same "invalid device ordinal" error again while running TensorRT-LLM in a SLURM multinode environment. If code modification inside TensorRT-LLM is required, I would appreciate it if you could let me know which specific part needs to be changed.

Below is the log output when the job is submitted using sbatch:

[Rank 2] Running multi_run_profile.sh at: slurm1-05 /host/TensorRT-LLM/examples/gpt/multi_run_profile.sh
[Rank 7] Running multi_run_profile.sh at: slurm2-03 /host/TensorRT-LLM/examples/gpt/multi_run_profile.sh
====================== DEBUG INFO ======================
====================== DEBUG INFO ======================
[Rank 1] Running multi_run_profile.sh at: slurm1-05 /host/TensorRT-LLM/examples/gpt/multi_run_profile.sh
[Rank 6] Running multi_run_profile.sh at: slurm2-03 /host/TensorRT-LLM/examples/gpt/multi_run_profile.sh
[Rank 0] Running multi_run_profile.sh at: slurm1-05 /host/TensorRT-LLM/examples/gpt/multi_run_profile.sh
====================== DEBUG INFO ======================
====================== DEBUG INFO ======================
[Rank 4] Running multi_run_profile.sh at: slurm2-03 /host/TensorRT-LLM/examples/gpt/multi_run_profile.sh
====================== DEBUG INFO ======================
[Rank 5] Running multi_run_profile.sh at: slurm2-03 /host/TensorRT-LLM/examples/gpt/multi_run_profile.sh
[Rank 3] Running multi_run_profile.sh at: slurm1-05 /host/TensorRT-LLM/examples/gpt/multi_run_profile.sh
====================== DEBUG INFO ======================
====================== DEBUG INFO ======================
====================== DEBUG INFO ======================
[Rank 2] Host=slurm1-05
========================================================
[Rank 7] Host=slurm2-03
[Rank 0] Host=slurm1-05
========================================================
========================================================
[Rank 1] Host=slurm1-05
[Rank 6] Host=slurm2-03
========================================================
========================================================
[Rank 3] Host=slurm1-05
[Rank 5] Host=slurm2-03
========================================================
========================================================
[Rank 4] Host=slurm2-03
========================================================
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040200
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040200
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040200
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040200
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040200
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040200
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040200
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040200
[TensorRT-LLM][INFO] Engine version 0.9.0.dev2024040200 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 8, rank: 0
[TensorRT-LLM][INFO] Engine version 0.9.0.dev2024040200 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 8, rank: 4
[TensorRT-LLM][INFO] Engine version 0.9.0.dev2024040200 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 8, rank: 3
[TensorRT-LLM][INFO] Engine version 0.9.0.dev2024040200 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 8, rank: 5
[TensorRT-LLM][INFO] Engine version 0.9.0.dev2024040200 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 8, rank: 1
[TensorRT-LLM][INFO] Engine version 0.9.0.dev2024040200 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 8, rank: 7
[TensorRT-LLM][INFO] Engine version 0.9.0.dev2024040200 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 8, rank: 6
====================== DEBUG: torch.cuda.set_device ======================
[Rank 4] SLURM_LOCALID=0
[Rank 4] CUDA_VISIBLE_DEVICES=0,1,2,3
[Rank 4] torch sees 4 GPUs
[Rank 4] Current CUDA device index: 0
=========================================================================
Traceback (most recent call last):
  File "/host/TensorRT-LLM/examples/gpt/../run.py", line 577, in <module>
    main(args)
  File "/host/TensorRT-LLM/examples/gpt/../run.py", line 426, in main
    runner = runner_cls.from_dir(**runner_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 173, in from_dir
    session = GptSession(config=session_config,
RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in cudaSetDevice(device): invalid device ordinal (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/runtime/utils/sessionUtils.cpp:34)
1       0x7f580929571f /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0xd271f) [0x7f580929571f]
2       0x7f58093b9e47 tensorrt_llm::runtime::GptSession::GptSession(tensorrt_llm::runtime::GptSession::Config const&, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, void const*, unsigned long, std::shared_ptr<nvinfer1::ILogger>) + 487
3       0x7f580cd3f4bd /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xb84bd) [0x7f580cd3f4bd]
4       0x7f580ccef139 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x68139) [0x7f580ccef139]
5       0x7f580ccd24a7 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x4b4a7) [0x7f580ccd24a7]
6       0x55630691a10e python3(+0x15a10e) [0x55630691a10e]
7       0x556306910a7b _PyObject_MakeTpCall + 603
8       0x556306928acb python3(+0x168acb) [0x556306928acb]
9       0x556306929635 _PyObject_Call + 277
10      0x556306925087 python3(+0x165087) [0x556306925087]
11      0x556306910e2b python3(+0x150e2b) [0x556306910e2b]
12      0x7f58dc9b1b84 /usr/local/lib/python3.10/dist-packages/onnx/onnx_cpp2py_export.cpython-310-x86_64-linux-gnu.so(+0xb0b84) [0x7f58dc9b1b84]
13      0x556306910a7b _PyObject_MakeTpCall + 603
14      0x55630690a150 _PyEval_EvalFrameDefault + 30112
15      0x5563069287f1 python3(+0x1687f1) [0x5563069287f1]
16      0x556306929492 PyObject_Call + 290
17      0x5563069055d7 _PyEval_EvalFrameDefault + 10791
18      0x55630691a9fc _PyFunction_Vectorcall + 124
19      0x55630690326d _PyEval_EvalFrameDefault + 1725
20      0x5563068ff9c6 python3(+0x13f9c6) [0x5563068ff9c6]
21      0x5563069f5256 PyEval_EvalCode + 134
22      0x556306a20108 python3(+0x260108) [0x556306a20108]
23      0x556306a199cb python3(+0x2599cb) [0x556306a199cb]
24      0x556306a1fe55 python3(+0x25fe55) [0x556306a1fe55]
25      0x556306a1f338 _PyRun_SimpleFileObject + 424
26      0x556306a1ef83 _PyRun_AnyFileObject + 67
27      0x556306a11a5e Py_RunMain + 702
28      0x5563069e802d Py_BytesMain + 45
29      0x7f5ae4f63d90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f5ae4f63d90]
30      0x7f5ae4f63e40 __libc_start_main + 128
31      0x5563069e7f25 _start + 37
====================== DEBUG: torch.cuda.set_device ======================
[Rank 5] SLURM_LOCALID=1
[Rank 5] CUDA_VISIBLE_DEVICES=0,1,2,3
[Rank 5] torch sees 4 GPUs
[Rank 5] Current CUDA device index: 0
=========================================================================
Traceback (most recent call last):
  File "/host/TensorRT-LLM/examples/gpt/../run.py", line 577, in <module>
    main(args)
  File "/host/TensorRT-LLM/examples/gpt/../run.py", line 426, in main
    runner = runner_cls.from_dir(**runner_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 173, in from_dir
    session = GptSession(config=session_config,
RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in cudaSetDevice(device): invalid device ordinal (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/runtime/utils/sessionUtils.cpp:34)

------------------------------------------------------------

$ sbatch sbatch_multi_run_gpt.sh

sbatch_multi_run_gpt.sh :

#!/bin/bash
#SBATCH --account=jychoi
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --partition=debug
#SBATCH --job-name=multinode_GPT3
#SBATCH --time=1:00:00
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --output=gpt_run_%j.log

 srun --mpi=pmix --ntasks=${tp_size} --ntasks-per-node=4 --nodes=2 \
                --export=ALL,gpt_model=${gpt_model},tp_size=${tp_size} \
            ./multi_run_profile.sh
 
- **multi_run_profile.sh** : 
#!/bin/bash
source gpt_config.sh
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=0
export NCCL_NET_GDR_LEVEL=2
export CUDA_VISIBLE_DEVICES=0,1,2,3
export CUDA_DEVICE=$SLURM_LOCALID

case $SLURMD_NODENAME in
  slurm1-05)
    export NCCL_SOCKET_IFNAME=ens21f0
    ;;
  slurm2-03)
    export NCCL_SOCKET_IFNAME=eno8403
    ;;
  *)
    echo "[ERROR] Unknown node $SLURMD_NODENAME"
    exit 1
    ;;
esac

nsys_result_path="nsys_results/nsys_${gpt_model}"
nsys_filename="report_${gpt_model}_${batch}batch_${SLURM_LOCALID}.nsys-rep"

echo "[Rank $SLURM_PROCID] Running multi_run_profile.sh at: slurm1-05 $(realpath $0)"

#export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}
# We do not need/cannot use CUDA_VISIBLE_DEVICES, because TRTLLM explicitly select it's device
#    [[nodiscard]] SizeType32 getDevice() const noexcept
#    {
#        return mDeviceIds[mRank % getGpusPerGroup()];
#    }

echo "====================== DEBUG INFO ======================"
#echo "[Rank $SLURM_PROCID] LocalID=$SLURM_LOCALID"
echo "[Rank $SLURM_PROCID] Host=$(hostname)"
#echo "[Rank $SLURM_PROCID] Node=$SLURMD_NODENAME"
#echo "[Rank $SLURM_PROCID] CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
#echo "[Rank $SLURM_PROCID] torch sees $(python3 -c 'import torch; print(torch.cuda.device_count())') GPUs"
echo "========================================================"

        python3 ../run.py --engine_dir ${gpt_model}/trt_engines/fp16/${tp_size}-gpu \
        --tokenizer_dir gpt \
        --input_file input.txt \
        --max_output_len 1024 \
        --max_input_length 2048 \
        --output_csv output.csv \
        --run_profiling

jinyangyuan-nvidia · 2025-04-07T02:19:22Z

Hi @glara76, according to the above logs, it seems that you are using v0.9. This version is not maintained any more. Can you try with the latest main branch and see whether the problem still exists?

After building engines, it is suggested to check pretrained_config.mapping.gpus_per_node and build_config.auto_parallel_config.gpus_per_node in config.json. If the values are 8, you need to change them to 4 before running inference.

glara76 added the bug label Mar 27, 2025

juney-nvidia assigned Superjomn and hchings Mar 27, 2025

juney-nvidia added the triaged label Mar 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Device Binding Runtime Error When Running GPT-3 in Multi-Node Mode Using Slurm #3123

CUDA Device Binding Runtime Error When Running GPT-3 in Multi-Node Mode Using Slurm #3123

glara76 commented Mar 27, 2025

juney-nvidia commented Mar 27, 2025

Superjomn commented Mar 28, 2025

glara76 commented Mar 28, 2025 •

edited

Loading

juney-nvidia commented Mar 28, 2025

glara76 commented Mar 28, 2025 •

edited

Loading

Superjomn commented Mar 28, 2025

jinyangyuan-nvidia commented Mar 28, 2025

glara76 commented Mar 31, 2025

glara76 commented Apr 4, 2025 •

edited by Superjomn

Loading

jinyangyuan-nvidia commented Apr 7, 2025

CUDA Device Binding Runtime Error When Running GPT-3 in Multi-Node Mode Using Slurm #3123

CUDA Device Binding Runtime Error When Running GPT-3 in Multi-Node Mode Using Slurm #3123

Comments

glara76 commented Mar 27, 2025

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

juney-nvidia commented Mar 27, 2025

Superjomn commented Mar 28, 2025

glara76 commented Mar 28, 2025 • edited Loading

juney-nvidia commented Mar 28, 2025

glara76 commented Mar 28, 2025 • edited Loading

....

....

Superjomn commented Mar 28, 2025

jinyangyuan-nvidia commented Mar 28, 2025

glara76 commented Mar 31, 2025

glara76 commented Apr 4, 2025 • edited by Superjomn Loading

jinyangyuan-nvidia commented Apr 7, 2025

glara76 commented Mar 28, 2025 •

edited

Loading

glara76 commented Mar 28, 2025 •

edited

Loading

glara76 commented Apr 4, 2025 •

edited by Superjomn

Loading