Skip to content

LambdaLabsML/b200-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gtc-2025-b200-benchmarks

Setup

Step 1: Build the TensorRT-LLM docker image

https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html

# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs
git lfs install

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
git lfs pull

make -C docker release_build

Step 2: Launch the TensorRT-LLM docker container

make -C docker release_run

You may want to include additional docker options. On some machines, you may need to include --privileged to run the container with CUDA. It's also helpful to mount the huggingface cache directory to avoid downloading the same model weights multiple times.

make -C docker release_run DOCKER_RUN_ARGS="--privileged -v ~/.cache/huggingface:/root/.cache/huggingface"

LLM Benchmarks

For each benchmark, you'll need to run three steps:

  1. Prepare the benchmark data
  2. Build the TensorRT-LLM engine
  3. Run the TensorRT-LLM throughput benchmark

Here's an example for Llama2-70B, with FP4 quantization, and a token output length of 128:

python benchmarks/cpp/prepare_dataset.py \
  --stdout --tokenizer meta-llama/Llama-2-70b-hf token-norm-dist \
  --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 \
  --num-requests 3000 > /tmp/Llama-2-70b-hf_synthetic_128_128.txt
trtllm-bench --model meta-llama/Llama-2-70b-hf build \
  --dataset /tmp/Llama-2-70b-hf_synthetic_128_128.txt --quantization NVFP4
trtllm-bench --model meta-llama/Llama-2-70b-hf throughput \
  --dataset /tmp/Llama-2-70b-hf_synthetic_128_128.txt \
  --engine_dir /tmp/meta-llama/Llama-2-70b-hf/tp_1_pp_1

For the full list of benchmark commands used, see llm_benchmark_commands.md.

Results

Model # of GPUs Input Length Output Length Throughput (tokens/sec/gpu) GPU Quantization Speedup
Llama-2-70B 1 128 128 3903.6405 H200 FP8 1.00
1 128 128 7213.7503 B200 FP8 1.85
1 128 128 11717.4619 B200 NVFP4 3.00
Llama-2-70B 1 128 4096 60748.9391 H200 FP8 1.00
1 128 4096 94886.0579 B200 FP8 1.56
1 128 4096 94701.1677 B200 NVFP4 1.56
Llama-2-70B 8 128 128 1877.4258 H200 FP8 1.00
8 128 128 2975.2828 B200 FP8 1.58
8 128 128 3216.2755 B200 NVFP4 1.71
Llama-2-70B 8 128 4096 2238.3794 H200 FP8 1.00
8 128 4096 3551.7451 B200 FP8 1.59
8 128 4096 3455.925 B200 NVFP4 1.54
Mixtral-8x7B-Instruct-v0.1 1 128 128 15907.6499 H200 FP8 1.00
1 128 128 29225.8067 B200 FP8 1.84
1 128 128 37988.2743 B200 NVFP4 2.39
Mixtral-8x7B-Instruct-v0.1 1 128 4096 8849.4385 H200 FP8 1.00
1 128 4096 17103.9583 B200 FP8 1.93
1 128 4096 20098.2333 B200 NVFP4 2.27
Llama-3.1-8B-Instruct 1 128 4096 17069.6232 H200 FP8 1.00
1 128 4096 28307.529 B200 FP8 1.66
1 128 4096 28619.8452 B200 NVFP4 1.68
Llama-3.2-1B 1 128 4096 60674.9548 H200 FP8 1.00
1 128 4096 95955.0542 B200 FP8 1.58
1 128 4096 94889.8228 B200 NVFP4 1.56
Falcon-180B 1 128 128 1314.5878 H200 W4A8_AWQ 1.00
1 128 128 1699.4749 B200 FP8 1.29
1 128 128 2242.0641 B200 NVFP4 1.71

Diffusion Benchmarks

You'll need the demos from the TensorRT repository to run the benchmarks.

git clone https://github.com/NVIDIA/TensorRT.git
cd TensorRT/demo/Diffusion

Install the onnx packages

pip install onnxmltool onnxruntime  

Here's an example of running Flux.1-dev with FP4 quantization:

python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
  --onnx-dir onnx/flux --engine-dir engine/flux \
  --batch-count 100 --num-warmup-runs 10 \
  --use-cuda-graph --build-static-batch --fp4 | tee benchmark.log

This awk command can be used to extract the average throughput from the log file:

grep -E '^\|[[:space:]]*Pipeline[[:space:]]*\|[[:space:]]*[0-9]+\.[0-9]+[[:space:]]*ms[[:space:]]*\|' benchmark.log | \
sed -E 's/^\|[[:space:]]*Pipeline[[:space:]]*\|[[:space:]]*([0-9]+\.[0-9]+)[[:space:]]*ms.*$/\1/' | \
awk '{ sum += $1; count++ } END { 
    if (count > 0) 
        printf "Count: %d, Sum: %.2f ms, Average: %.2f ms\n", count, sum, sum/count; 
    else 
        print "No Pipeline lines found." 
}'

For the full list of benchmark commands used, see diffusion_benchmark_commands.md.

Results

Model Latency (ms) Images/sec GPU Quantization Speedup
Flux.1-dev 6966.45 0.1435451342 H200 FP16 1.00
4717.35 0.2119834229 H200 FP8 1.48
2476.35 0.4038201385 B200 FP8 2.81
2189.18 0.4567920409 B200 FP4 3.18
StableDiffusionXL-1.0 1049.45 0.95288008 H200 FP16 1.00
788.37 1.268439946 H200 FP8 1.33
686.00 1.457725948 B200 FP16 1.53
567.68 1.761555806 B200 FP8 1.85
StableDiffusion-2.1 239.39 4.17728393 H200 FP16 1.00
187.70 5.327650506 H200 FP8 1.28
176.63 5.661552398 B200 FP16 1.36
162.32 6.160670281 B200 FP8 1.47

About

Benchmarking scripts for B200

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors