Skip to content

Commit 86af3d2

Browse files
committed
Merge remote-tracking branch 'upstream/main' into HEAD
2 parents f4039a9 + 51f0b5f commit 86af3d2

File tree

1,207 files changed

+38568
-5925
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,207 files changed

+38568
-5925
lines changed

.buildkite/check-wheel-size.py

+4-2
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,14 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
13
import os
24
import sys
35
import zipfile
46

5-
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 300 MiB
7+
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 400 MiB
68
# Note that we have 400 MiB quota, please use it wisely.
79
# See https://github.com/pypi/support/issues/3792 .
810
# Please also sync the value with the one in Dockerfile.
9-
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 300))
11+
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 400))
1012

1113

1214
def print_top_10_largest_files(zip_file):

.buildkite/generate_index.py

+2
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
13
import argparse
24
import os
35

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM -b "auto" -t 2
2+
model_name: "nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM"
3+
tasks:
4+
- name: "gsm8k"
5+
metrics:
6+
- name: "exact_match,strict-match"
7+
value: 0.6353
8+
- name: "exact_match,flexible-extract"
9+
value: 0.637
10+
limit: null
11+
num_fewshot: null

.buildkite/lm-eval-harness/test_lm_eval_correctness.py

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# SPDX-License-Identifier: Apache-2.0
12
"""
23
LM eval harness on model to compare vs HF baseline computed offline.
34
Configs are found in configs/$MODEL.yaml

.buildkite/nightly-benchmarks/README.md

+18-28
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,13 @@
11
# vLLM benchmark suite
22

3-
43
## Introduction
54

65
This directory contains two sets of benchmark for vllm.
6+
77
- Performance benchmark: benchmark vllm's performance under various workload, for **developers** to gain clarity on whether their PR improves/degrades vllm's performance
88
- Nightly benchmark: compare vllm's performance against alternatives (tgi, trt-llm and lmdeploy), for **the public** to know when to choose vllm.
99

10-
11-
See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
12-
10+
See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
1311

1412
## Performance benchmark quick overview
1513

@@ -19,17 +17,14 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performan
1917

2018
**For benchmarking developers**: please try your best to constraint the duration of benchmarking to about 1 hr so that it won't take forever to run.
2119

22-
2320
## Nightly benchmark quick overview
2421

25-
**Benchmarking Coverage**: Fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) on Llama-3 8B, 70B and Mixtral 8x7B.
22+
**Benchmarking Coverage**: Fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) on Llama-3 8B, 70B and Mixtral 8x7B.
2623

2724
**Benchmarking engines**: vllm, TGI, trt-llm and lmdeploy.
2825

2926
**Benchmarking Duration**: about 3.5hrs.
3027

31-
32-
3328
## Trigger the benchmark
3429

3530
Performance benchmark will be triggered when:
@@ -39,16 +34,11 @@ Performance benchmark will be triggered when:
3934
Nightly benchmark will be triggered when:
4035
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.
4136

42-
43-
44-
4537
## Performance benchmark details
4638

47-
4839
See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
4940

50-
51-
#### Latency test
41+
### Latency test
5242

5343
Here is an example of one test inside `latency-tests.json`:
5444

@@ -68,23 +58,25 @@ Here is an example of one test inside `latency-tests.json`:
6858
```
6959

7060
In this example:
71-
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
72-
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
61+
62+
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
63+
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
7364

7465
Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.
7566

7667
WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.
7768

69+
### Throughput test
7870

79-
#### Throughput test
8071
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.
8172

8273
The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.
8374

84-
#### Serving test
75+
### Serving test
76+
8577
We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
8678

87-
```
79+
```json
8880
[
8981
{
9082
"test_name": "serving_llama8B_tp1_sharegpt",
@@ -109,6 +101,7 @@ We test the throughput by using `benchmark_serving.py` with request rate = inf t
109101
```
110102

111103
Inside this example:
104+
112105
- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
113106
- The `server-parameters` includes the command line arguments for vLLM server.
114107
- The `client-parameters` includes the command line arguments for `benchmark_serving.py`.
@@ -118,36 +111,33 @@ The number of this test is less stable compared to the delay and latency benchma
118111

119112
WARNING: The benchmarking script will save json results by itself, so please do not configure `--save-results` or other results-saving-related parameters in `serving-tests.json`.
120113

121-
#### Visualizing the results
114+
### Visualizing the results
115+
122116
The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table, by formatting [descriptions.md](tests/descriptions.md) with real benchmarking results.
123117
You can find the result presented as a table inside the `buildkite/performance-benchmark` job page.
124118
If you do not see the table, please wait till the benchmark finish running.
125119
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
126120
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
127121

128-
129-
130122
## Nightly test details
131123

132124
See [nightly-descriptions.md](nightly-descriptions.md) for the detailed description on test workload, models and docker containers of benchmarking other llm engines.
133125

126+
### Workflow
134127

135-
#### Workflow
136-
137-
- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
128+
- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
138129
- Inside each container, we run [run-nightly-suite.sh](run-nightly-suite.sh), which will probe the serving engine of the current container.
139130
- The `run-nightly-suite.sh` will redirect the request to `tests/run-[llm serving engine name]-nightly.sh`, which parses the workload described in [nightly-tests.json](tests/nightly-tests.json) and performs the benchmark.
140131
- At last, we run [scripts/plot-nightly-results.py](scripts/plot-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite.
141132

142-
#### Nightly tests
133+
### Nightly tests
143134

144135
In [nightly-tests.json](tests/nightly-tests.json), we include the command line arguments for benchmarking commands, together with the benchmarking test cases. The format is highly similar to performance benchmark.
145136

146-
#### Docker containers
137+
### Docker containers
147138

148139
The docker containers for benchmarking are specified in `nightly-pipeline.yaml`.
149140

150141
WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `tests/run-[llm serving engine name]-nightly.sh`.
151142

152143
WARNING: populating `trt-llm` to latest version is not easy, as it requires updating several protobuf files in [tensorrt-demo](https://github.com/neuralmagic/tensorrt-demo.git).
153-

.buildkite/nightly-benchmarks/nightly-annotation.md

+10-11
Original file line numberDiff line numberDiff line change
@@ -9,20 +9,19 @@ This file contains the downloading link for benchmarking results.
99

1010
Please download the visualization scripts in the post
1111

12-
1312
## Results reproduction
1413

1514
- Find the docker we use in `benchmarking pipeline`
1615
- Deploy the docker, and inside the docker:
17-
- Download `nightly-benchmarks.zip`.
18-
- In the same folder, run the following code
19-
```
20-
export HF_TOKEN=<your HF token>
21-
apt update
22-
apt install -y git
23-
unzip nightly-benchmarks.zip
24-
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
25-
```
16+
- Download `nightly-benchmarks.zip`.
17+
- In the same folder, run the following code:
2618

27-
And the results will be inside `./benchmarks/results`.
19+
```console
20+
export HF_TOKEN=<your HF token>
21+
apt update
22+
apt install -y git
23+
unzip nightly-benchmarks.zip
24+
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
25+
```
2826

27+
And the results will be inside `./benchmarks/results`.

.buildkite/nightly-benchmarks/nightly-descriptions.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,14 @@
22
# Nightly benchmark
33

44
This benchmark aims to:
5+
56
- Provide performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and SGLang) leads in performance in what workload.
67
- Be reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions.
78

89
Latest results: [results link](https://blog.vllm.ai/2024/09/05/perf-update.html), scroll to the end.
910

1011
Latest reproduction guilde: [github issue link](https://github.com/vllm-project/vllm/issues/8176)
1112

12-
1313
## Setup
1414

1515
- Docker images:
@@ -33,7 +33,7 @@ Latest reproduction guilde: [github issue link](https://github.com/vllm-project/
3333
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
3434
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
3535

36-
# Known issues
36+
## Known issues
3737

3838
- TRT-LLM crashes with Llama 3.1 8B [issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105).
39-
- TGI does not support `ignore-eos` flag.
39+
- TGI does not support `ignore-eos` flag.

.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md

+2-8
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,8 @@
77
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
88
- Evaluation metrics: end-to-end latency (mean, median, p99).
99

10-
1110
{latency_tests_markdown_table}
1211

13-
1412
## Throughput tests
1513

1614
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
@@ -19,10 +17,8 @@
1917
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
2018
- Evaluation metrics: throughput.
2119

22-
2320
{throughput_tests_markdown_table}
2421

25-
2622
## Serving tests
2723

2824
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
@@ -33,13 +29,11 @@
3329
- We also added a speculative decoding test for llama-3 70B, under QPS 2
3430
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
3531

36-
3732
{serving_tests_markdown_table}
3833

39-
4034
## json version of the benchmarking tables
4135

42-
This section contains the data of the markdown tables above in JSON format.
36+
This section contains the data of the markdown tables above in JSON format.
4337
You can load the benchmarking tables into pandas dataframes as follows:
4438

4539
```python
@@ -54,9 +48,9 @@ serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])
5448
```
5549

5650
The json string for all benchmarking tables:
51+
5752
```json
5853
{benchmarking_results_in_json_string}
5954
```
6055

6156
You can also check the raw experiment data in the Artifact tab of the Buildkite page.
62-

.buildkite/nightly-benchmarks/scripts/convert-results-json-to-markdown.py

+2
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
13
import json
24
import os
35
from pathlib import Path

.buildkite/nightly-benchmarks/scripts/download-tokenizer.py

+2
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
13
import argparse
24

35
from transformers import AutoTokenizer

.buildkite/nightly-benchmarks/scripts/generate-nightly-markdown.py

+2
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
13
import argparse
24
import json
35
from pathlib import Path

.buildkite/nightly-benchmarks/scripts/get-lmdeploy-modelname.py

+2
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
13
from lmdeploy.serve.openai.api_client import APIClient
24

35
api_client = APIClient("http://localhost:8000")

.buildkite/nightly-benchmarks/scripts/summary-nightly-results.py

+2
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
13
import datetime
24
import json
35
import os

.buildkite/release-pipeline.yaml

+7-2
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,11 @@ steps:
5656
env:
5757
DOCKER_BUILDKIT: "1"
5858

59+
- input: "Provide Release version here"
60+
fields:
61+
- text: "What is the release version?"
62+
key: "release-version"
63+
5964
- block: "Build CPU release image"
6065
key: block-cpu-release-image-build
6166
depends_on: ~
@@ -66,7 +71,7 @@ steps:
6671
queue: cpu_queue_postmerge
6772
commands:
6873
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
69-
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$RELEASE_VERSION --progress plain -f Dockerfile.cpu ."
70-
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$RELEASE_VERSION"
74+
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version) --progress plain -f Dockerfile.cpu ."
75+
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version)"
7176
env:
7277
DOCKER_BUILDKIT: "1"

.buildkite/run-gh200-test.sh

+2-2
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,6 @@ trap remove_docker_container EXIT
2323
remove_docker_container
2424

2525
# Run the image and test offline inference
26-
docker run --name gh200-test --gpus=all --entrypoint="" gh200-test bash -c '
27-
python3 examples/offline_inference/basic.py
26+
docker run -e HF_TOKEN -v /root/.cache/huggingface:/root/.cache/huggingface --name gh200-test --gpus=all --entrypoint="" gh200-test bash -c '
27+
python3 examples/offline_inference/cli.py --model meta-llama/Llama-3.2-1B
2828
'

.buildkite/run-neuron-test.sh

-3
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,6 @@ if [ -f /tmp/neuron-docker-build-timestamp ]; then
2929
docker image prune -f
3030
# Remove unused volumes / force the system prune for old images as well.
3131
docker volume prune -f && docker system prune -f
32-
# Remove huggingface model artifacts and compiler cache
33-
rm -rf "${HF_MOUNT:?}/*"
34-
rm -rf "${NEURON_COMPILE_CACHE_MOUNT:?}/*"
3532
echo "$current_time" > /tmp/neuron-docker-build-timestamp
3633
fi
3734
else

0 commit comments

Comments
 (0)