https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs
git lfs install
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
git lfs pull
make -C docker release_buildmake -C docker release_runYou may want to include additional docker options.
On some machines, you may need to include --privileged to run the container with CUDA.
It's also helpful to mount the huggingface cache directory to avoid downloading the same model weights multiple times.
make -C docker release_run DOCKER_RUN_ARGS="--privileged -v ~/.cache/huggingface:/root/.cache/huggingface"For each benchmark, you'll need to run three steps:
- Prepare the benchmark data
- Build the TensorRT-LLM engine
- Run the TensorRT-LLM throughput benchmark
Here's an example for Llama2-70B, with FP4 quantization, and a token output length of 128:
python benchmarks/cpp/prepare_dataset.py \
--stdout --tokenizer meta-llama/Llama-2-70b-hf token-norm-dist \
--input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 \
--num-requests 3000 > /tmp/Llama-2-70b-hf_synthetic_128_128.txttrtllm-bench --model meta-llama/Llama-2-70b-hf build \
--dataset /tmp/Llama-2-70b-hf_synthetic_128_128.txt --quantization NVFP4trtllm-bench --model meta-llama/Llama-2-70b-hf throughput \
--dataset /tmp/Llama-2-70b-hf_synthetic_128_128.txt \
--engine_dir /tmp/meta-llama/Llama-2-70b-hf/tp_1_pp_1For the full list of benchmark commands used, see llm_benchmark_commands.md.
| Model | # of GPUs | Input Length | Output Length | Throughput (tokens/sec/gpu) | GPU | Quantization | Speedup |
|---|---|---|---|---|---|---|---|
| Llama-2-70B | 1 | 128 | 128 | 3903.6405 | H200 | FP8 | 1.00 |
| 1 | 128 | 128 | 7213.7503 | B200 | FP8 | 1.85 | |
| 1 | 128 | 128 | 11717.4619 | B200 | NVFP4 | 3.00 | |
| Llama-2-70B | 1 | 128 | 4096 | 60748.9391 | H200 | FP8 | 1.00 |
| 1 | 128 | 4096 | 94886.0579 | B200 | FP8 | 1.56 | |
| 1 | 128 | 4096 | 94701.1677 | B200 | NVFP4 | 1.56 | |
| Llama-2-70B | 8 | 128 | 128 | 1877.4258 | H200 | FP8 | 1.00 |
| 8 | 128 | 128 | 2975.2828 | B200 | FP8 | 1.58 | |
| 8 | 128 | 128 | 3216.2755 | B200 | NVFP4 | 1.71 | |
| Llama-2-70B | 8 | 128 | 4096 | 2238.3794 | H200 | FP8 | 1.00 |
| 8 | 128 | 4096 | 3551.7451 | B200 | FP8 | 1.59 | |
| 8 | 128 | 4096 | 3455.925 | B200 | NVFP4 | 1.54 | |
| Mixtral-8x7B-Instruct-v0.1 | 1 | 128 | 128 | 15907.6499 | H200 | FP8 | 1.00 |
| 1 | 128 | 128 | 29225.8067 | B200 | FP8 | 1.84 | |
| 1 | 128 | 128 | 37988.2743 | B200 | NVFP4 | 2.39 | |
| Mixtral-8x7B-Instruct-v0.1 | 1 | 128 | 4096 | 8849.4385 | H200 | FP8 | 1.00 |
| 1 | 128 | 4096 | 17103.9583 | B200 | FP8 | 1.93 | |
| 1 | 128 | 4096 | 20098.2333 | B200 | NVFP4 | 2.27 | |
| Llama-3.1-8B-Instruct | 1 | 128 | 4096 | 17069.6232 | H200 | FP8 | 1.00 |
| 1 | 128 | 4096 | 28307.529 | B200 | FP8 | 1.66 | |
| 1 | 128 | 4096 | 28619.8452 | B200 | NVFP4 | 1.68 | |
| Llama-3.2-1B | 1 | 128 | 4096 | 60674.9548 | H200 | FP8 | 1.00 |
| 1 | 128 | 4096 | 95955.0542 | B200 | FP8 | 1.58 | |
| 1 | 128 | 4096 | 94889.8228 | B200 | NVFP4 | 1.56 | |
| Falcon-180B | 1 | 128 | 128 | 1314.5878 | H200 | W4A8_AWQ | 1.00 |
| 1 | 128 | 128 | 1699.4749 | B200 | FP8 | 1.29 | |
| 1 | 128 | 128 | 2242.0641 | B200 | NVFP4 | 1.71 |
You'll need the demos from the TensorRT repository to run the benchmarks.
git clone https://github.com/NVIDIA/TensorRT.git
cd TensorRT/demo/DiffusionInstall the onnx packages
pip install onnxmltool onnxruntime Here's an example of running Flux.1-dev with FP4 quantization:
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
--onnx-dir onnx/flux --engine-dir engine/flux \
--batch-count 100 --num-warmup-runs 10 \
--use-cuda-graph --build-static-batch --fp4 | tee benchmark.logThis awk command can be used to extract the average throughput from the log file:
grep -E '^\|[[:space:]]*Pipeline[[:space:]]*\|[[:space:]]*[0-9]+\.[0-9]+[[:space:]]*ms[[:space:]]*\|' benchmark.log | \
sed -E 's/^\|[[:space:]]*Pipeline[[:space:]]*\|[[:space:]]*([0-9]+\.[0-9]+)[[:space:]]*ms.*$/\1/' | \
awk '{ sum += $1; count++ } END {
if (count > 0)
printf "Count: %d, Sum: %.2f ms, Average: %.2f ms\n", count, sum, sum/count;
else
print "No Pipeline lines found."
}'For the full list of benchmark commands used, see diffusion_benchmark_commands.md.
| Model | Latency (ms) | Images/sec | GPU | Quantization | Speedup |
|---|---|---|---|---|---|
| Flux.1-dev | 6966.45 | 0.1435451342 | H200 | FP16 | 1.00 |
| 4717.35 | 0.2119834229 | H200 | FP8 | 1.48 | |
| 2476.35 | 0.4038201385 | B200 | FP8 | 2.81 | |
| 2189.18 | 0.4567920409 | B200 | FP4 | 3.18 | |
| StableDiffusionXL-1.0 | 1049.45 | 0.95288008 | H200 | FP16 | 1.00 |
| 788.37 | 1.268439946 | H200 | FP8 | 1.33 | |
| 686.00 | 1.457725948 | B200 | FP16 | 1.53 | |
| 567.68 | 1.761555806 | B200 | FP8 | 1.85 | |
| StableDiffusion-2.1 | 239.39 | 4.17728393 | H200 | FP16 | 1.00 |
| 187.70 | 5.327650506 | H200 | FP8 | 1.28 | |
| 176.63 | 5.661552398 | B200 | FP16 | 1.36 | |
| 162.32 | 6.160670281 | B200 | FP8 | 1.47 |