gtc-2025-b200-benchmarks

Setup
LLM Benchmarks
- Results
Diffusion Benchmarks
- Results

Setup

Step 1: Build the TensorRT-LLM docker image

https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html

# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs
git lfs install

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
git lfs pull

make -C docker release_build

Step 2: Launch the TensorRT-LLM docker container

make -C docker release_run

You may want to include additional docker options. On some machines, you may need to include --privileged to run the container with CUDA. It's also helpful to mount the huggingface cache directory to avoid downloading the same model weights multiple times.

make -C docker release_run DOCKER_RUN_ARGS="--privileged -v ~/.cache/huggingface:/root/.cache/huggingface"

LLM Benchmarks

For each benchmark, you'll need to run three steps:

Prepare the benchmark data
Build the TensorRT-LLM engine
Run the TensorRT-LLM throughput benchmark

Here's an example for Llama2-70B, with FP4 quantization, and a token output length of 128:

python benchmarks/cpp/prepare_dataset.py \
  --stdout --tokenizer meta-llama/Llama-2-70b-hf token-norm-dist \
  --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 \
  --num-requests 3000 > /tmp/Llama-2-70b-hf_synthetic_128_128.txt

trtllm-bench --model meta-llama/Llama-2-70b-hf build \
  --dataset /tmp/Llama-2-70b-hf_synthetic_128_128.txt --quantization NVFP4

trtllm-bench --model meta-llama/Llama-2-70b-hf throughput \
  --dataset /tmp/Llama-2-70b-hf_synthetic_128_128.txt \
  --engine_dir /tmp/meta-llama/Llama-2-70b-hf/tp_1_pp_1

For the full list of benchmark commands used, see llm_benchmark_commands.md.

Results

Model	# of GPUs	Input Length	Output Length	Throughput (tokens/sec/gpu)	GPU	Quantization	Speedup
Llama-2-70B	1	128	128	3903.6405	H200	FP8	1.00
	1	128	128	7213.7503	B200	FP8	1.85
	1	128	128	11717.4619	B200	NVFP4	3.00
Llama-2-70B	1	128	4096	60748.9391	H200	FP8	1.00
	1	128	4096	94886.0579	B200	FP8	1.56
	1	128	4096	94701.1677	B200	NVFP4	1.56
Llama-2-70B	8	128	128	1877.4258	H200	FP8	1.00
	8	128	128	2975.2828	B200	FP8	1.58
	8	128	128	3216.2755	B200	NVFP4	1.71
Llama-2-70B	8	128	4096	2238.3794	H200	FP8	1.00
	8	128	4096	3551.7451	B200	FP8	1.59
	8	128	4096	3455.925	B200	NVFP4	1.54
Mixtral-8x7B-Instruct-v0.1	1	128	128	15907.6499	H200	FP8	1.00
	1	128	128	29225.8067	B200	FP8	1.84
	1	128	128	37988.2743	B200	NVFP4	2.39
Mixtral-8x7B-Instruct-v0.1	1	128	4096	8849.4385	H200	FP8	1.00
	1	128	4096	17103.9583	B200	FP8	1.93
	1	128	4096	20098.2333	B200	NVFP4	2.27
Llama-3.1-8B-Instruct	1	128	4096	17069.6232	H200	FP8	1.00
	1	128	4096	28307.529	B200	FP8	1.66
	1	128	4096	28619.8452	B200	NVFP4	1.68
Llama-3.2-1B	1	128	4096	60674.9548	H200	FP8	1.00
	1	128	4096	95955.0542	B200	FP8	1.58
	1	128	4096	94889.8228	B200	NVFP4	1.56
Falcon-180B	1	128	128	1314.5878	H200	W4A8_AWQ	1.00
	1	128	128	1699.4749	B200	FP8	1.29
	1	128	128	2242.0641	B200	NVFP4	1.71

Diffusion Benchmarks

You'll need the demos from the TensorRT repository to run the benchmarks.

git clone https://github.com/NVIDIA/TensorRT.git
cd TensorRT/demo/Diffusion

Install the onnx packages

pip install onnxmltool onnxruntime

Here's an example of running Flux.1-dev with FP4 quantization:

python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
  --onnx-dir onnx/flux --engine-dir engine/flux \
  --batch-count 100 --num-warmup-runs 10 \
  --use-cuda-graph --build-static-batch --fp4 | tee benchmark.log

This awk command can be used to extract the average throughput from the log file:

grep -E '^\|[[:space:]]*Pipeline[[:space:]]*\|[[:space:]]*[0-9]+\.[0-9]+[[:space:]]*ms[[:space:]]*\|' benchmark.log | \
sed -E 's/^\|[[:space:]]*Pipeline[[:space:]]*\|[[:space:]]*([0-9]+\.[0-9]+)[[:space:]]*ms.*$/\1/' | \
awk '{ sum += $1; count++ } END { 
    if (count > 0) 
        printf "Count: %d, Sum: %.2f ms, Average: %.2f ms\n", count, sum, sum/count; 
    else 
        print "No Pipeline lines found." 
}'

For the full list of benchmark commands used, see diffusion_benchmark_commands.md.

Results

Model	Latency (ms)	Images/sec	GPU	Quantization	Speedup
Flux.1-dev	6966.45	0.1435451342	H200	FP16	1.00
	4717.35	0.2119834229	H200	FP8	1.48
	2476.35	0.4038201385	B200	FP8	2.81
	2189.18	0.4567920409	B200	FP4	3.18
StableDiffusionXL-1.0	1049.45	0.95288008	H200	FP16	1.00
	788.37	1.268439946	H200	FP8	1.33
	686.00	1.457725948	B200	FP16	1.53
	567.68	1.761555806	B200	FP8	1.85
StableDiffusion-2.1	239.39	4.17728393	H200	FP16	1.00
	187.70	5.327650506	H200	FP8	1.28
	176.63	5.661552398	B200	FP16	1.36
	162.32	6.160670281	B200	FP8	1.47

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
results		results
.gitignore		.gitignore
README.md		README.md
diffusion_benchmark_commands.md		diffusion_benchmark_commands.md
llm_benchmark_commands.md		llm_benchmark_commands.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

gtc-2025-b200-benchmarks

Setup

Step 1: Build the TensorRT-LLM docker image

Step 2: Launch the TensorRT-LLM docker container

LLM Benchmarks

Results

Diffusion Benchmarks

Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

gtc-2025-b200-benchmarks

Setup

Step 1: Build the TensorRT-LLM docker image

Step 2: Launch the TensorRT-LLM docker container

LLM Benchmarks

Results

Diffusion Benchmarks

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages