Skip to content

Commit 3203dd7

Browse files
AgrawalAmeyAmey Agrawal
andauthored
[Bugfix]: Revert scheduler regression and introduce canary branch (microsoft#65)
* Revert "[Core][Doc][CI/Build][Bugfix][Profiling] Multi-replica routing polices, prefix caching, `uv`, and a much faster and lighter Vidur (microsoft#56)" This reverts commit a815fd0. * minor * minor --------- Co-authored-by: Amey Agrawal <ameyagrawal@ipsec-10-2-129-73.vpn.gatech.edu>
1 parent a815fd0 commit 3203dd7

File tree

154 files changed

+147484
-873765
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

154 files changed

+147484
-873765
lines changed

.github/workflows/lint.yml

Lines changed: 7 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,11 @@ jobs:
1616
steps:
1717
- name: "Checkout Repository"
1818
uses: actions/checkout@v3
19-
- name: Install uv
20-
uses: astral-sh/setup-uv@v5
19+
- name: Install Conda environment from environment-dev.yml
20+
uses: mamba-org/setup-micromamba@v1
2121
with:
22-
# Install a specific version of uv.
23-
version: "0.7.3"
24-
- name: Install the project
25-
run: uv sync --locked --all-extras --dev
26-
- name: Run black
27-
run: uv run black vidur
28-
- name: Run isort
29-
run: uv run isort --profile black vidur
22+
environment-file: environment-dev.yml
23+
- name: "Run black lint"
24+
run: make lint/black
25+
- name: "Run isort check"
26+
run: make lint/isort

.gitignore

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,7 @@ cache
165165
cache_random_forrest
166166
cache_linear_regression
167167
cache*
168-
simulator_outputs
168+
simulator_output
169169
wandb
170170
train.zip
171171
profiling_outputs
@@ -177,8 +177,3 @@ config_optimizer_output_tmpfs
177177
profiler_traces*
178178
experiments/profiling/get_profiled_data_from_trace.ipynb
179179
env_3
180-
config_optimizer_output*
181-
experiments/miscellaneous/request_length_trace_analysis.ipynb
182-
vidur/config_optimizer/config_explorer/config/config_llama3_8b.yml
183-
experiments/global_scheduler/get_uniform_trace.ipynb
184-
prefill_throughput_output

.python-version

Lines changed: 0 additions & 1 deletion
This file was deleted.

.vscode/settings.json

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,5 @@
33
"INTERNLM",
44
"QWEN",
55
"vidur"
6-
],
7-
"python.analysis.fixAll" : ["source.unusedImports"]
6+
]
87
}

README.md

Lines changed: 74 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -20,69 +20,110 @@ Vidur is a high-fidelity and extensible LLM inference system simulator. It can h
2020

2121
## Supported Models
2222

23-
| Model / Device | H100 DGX | A100 80GB DGX | 4xA100 80GB Pairwise NVLink Node | 8xA40 Pairwise NVLink Node |
23+
__Instructions on adding a new model to existing or new SKUs can be found [here](docs/profiling.md)__.
24+
25+
| Model / Device | A100 80GB DGX | H100 DGX | 4xA100 80GB Pairwise NVLink Node | 8xA40 Pairwise NVLink Node |
2426
| --- | --- | --- | --- | --- |
25-
| `meta-llama/Meta-Llama-3-8B` || |||
26-
| `meta-llama/Meta-Llama-3-70B` || |||
27+
| `meta-llama/Meta-Llama-3-8B` || |||
28+
| `meta-llama/Meta-Llama-3-70B` || |||
2729
| `meta-llama/Llama-2-7b-hf` |||||
2830
| `codellama/CodeLlama-34b-Instruct-hf"` |||||
2931
| `meta-llama/Llama-2-70b-hf` |||||
3032
| `internlm/internlm-20b` |||||
3133
| `Qwen/Qwen-72B` |||||
3234

33-
* __Instructions on adding a new model to existing or new SKUs can be found [here](docs/profiling.md)__.
34-
* All models support a maximum context length of 4k except `Llama3-8B` and `Llama3-70B` which support 16k context length.
35+
* All models support a maximum context length of 4k except `Llama3-8B` and `Llama3-70B` which support 16k context length by passing additional CLI params:
36+
37+
```text
38+
--random_forrest_execution_time_predictor_config_prediction_max_prefill_chunk_size 16384 \
39+
--random_forrest_execution_time_predictor_config_prediction_max_batch_size 512 \
40+
--random_forrest_execution_time_predictor_config_prediction_max_tokens_per_request 16384
41+
```
42+
3543
* Pipeline parallelism is supported for all models. The PP dimension should divide the number of layers in the model.
3644
* In DGX nodes, there are 8 GPUs, fully connected via NVLink. So TP1, TP2, TP4 and TP8 are supported.
3745
* In 4x pairwise NVLink nodes, there are 4 GPUs, so TP1, TP2 and TP4 are supported. TP4 here is less performant than TP4 in DGX nodes because (GPU1, GPU2) are connected via NVLink and (GPU3, GPU4) are connected via NVLink. but between these layers, the interconnect is slower.
3846
* You can use any combination of TP and PP. For example, you can run LLaMA2-70B on TP2-PP2 on a 4xA100 80GB Pairwise NVLink Node.
3947
40-
## Setup (using `uv`)
48+
## Setup
49+
50+
### Using `mamba`
51+
52+
To run the simulator, create a mamba environment with the given dependency file.
53+
54+
```sh
55+
mamba env create -p ./env -f ./environment.yml
56+
mamba env update -f environment-dev.yml
57+
```
58+
59+
### Using `venv`
60+
61+
1. Ensure that you have Python 3.10 installed on your system. Refer <https://www.bitecode.dev/p/installing-python-the-bare-minimum>
62+
2. `cd` into the repository root
63+
3. Create a virtual environment using `venv` module using `python3.10 -m venv .venv`
64+
4. Activate the virtual environment using `source .venv/bin/activate`
65+
5. Install the dependencies using `python -m pip install -r requirements.txt`
66+
6. Run `deactivate` to deactivate the virtual environment
67+
68+
### Using `conda` (Least recommended)
69+
70+
To run the simulator, create a conda environment with the given dependency file.
4171

42-
1. Install [uv](https://docs.astral.sh/uv/getting-started/installation/#installation-methods)
43-
2. At project root, run `uv venv` to create a new virtual environment.
44-
3. Activate the environment using `source .venv/bin/activate`.
45-
4. Install dependencies using `uv sync`. The environment is now ready for use.
72+
```sh
73+
conda env create -p ./env -f ./environment.yml
74+
conda env update -f environment-dev.yml
75+
```
4676

47-
## Setting up wandb (Optional)
77+
### Setting up wandb (Optional)
4878

4979
First, setup your account on `https://<your-org>.wandb.io/` or public wandb, obtain the api key and then run the following command,
5080

5181
```sh
5282
wandb login --host https://<your-org>.wandb.io
5383
```
5484

55-
To opt out of wandb, set `export WANDB_MODE=disabled` in your shell or add this in `~/.zshrc` or `~/.bashrc`. Remember to reload using `source ~/.zshrc` or `source ~/.bashrc`.
85+
To opt out of wandb, pick any one of the following methods:
86+
87+
1. `export WANDB_MODE=disabled` in your shell or add this in `~/.zshrc` or `~/.bashrc`. Remember to reload using `source ~/.zshrc`.
88+
2. Set `wandb_project` and `wandb_group` as `""` in `vidur/config/default.yml`. Also, remove these CLI params from the shell command with which the simulator is invoked.
5689

5790
## Running the simulator
5891

59-
To run the simulator, execute the following command from the repository root:
92+
To run the simulator, execute the following command from the repository root,
93+
94+
```sh
95+
python -m vidur.main
96+
```
97+
98+
or a big example with all the parameters,
6099

61100
```sh
62-
python -m vidur.main \
63-
--time_limit 10800 \
101+
python -m vidur.main \
102+
--replica_config_device a100 \
64103
--replica_config_model_name meta-llama/Meta-Llama-3-8B \
65-
--replica_config_device h100 \
66-
--replica_config_network_device h100_dgx \
67-
--cluster_config_num_replicas 8 \
104+
--cluster_config_num_replicas 1 \
68105
--replica_config_tensor_parallel_size 1 \
69106
--replica_config_num_pipeline_stages 1 \
70107
--request_generator_config_type synthetic \
71-
--synthetic_request_generator_config_num_requests 128 \
108+
--synthetic_request_generator_config_num_requests 512 \
72109
--length_generator_config_type trace \
73-
--trace_request_length_generator_config_trace_file ./data/processed_traces/mooncake_conversation_trace.csv \
110+
--trace_request_length_generator_config_max_tokens 16384 \
111+
--trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \
74112
--interval_generator_config_type poisson \
75-
--poisson_request_interval_generator_config_qps 8.0 \
76-
--global_scheduler_config_type round_robin \
77-
--replica_scheduler_config_type vllm_v1 \
78-
--vllm_v1_scheduler_config_chunk_size 512 \
79-
--vllm_v1_scheduler_config_batch_size_cap 512 \
80-
--cache_config_enable_prefix_caching
113+
--poisson_request_interval_generator_config_qps 6.45 \
114+
--replica_scheduler_config_type sarathi \
115+
--sarathi_scheduler_config_batch_size_cap 512 \
116+
--sarathi_scheduler_config_chunk_size 512 \
117+
--random_forrest_execution_time_predictor_config_prediction_max_prefill_chunk_size 16384 \
118+
--random_forrest_execution_time_predictor_config_prediction_max_batch_size 512 \
119+
--random_forrest_execution_time_predictor_config_prediction_max_tokens_per_request 16384
81120
```
82121

83-
The command above simulates a scenario with a H100 DGX node running 8 replicas of the `Meta-Llama-3-8B` model, with synthetic requests generated at a QPS of 8. The `mooncake_conversation` trace file is used for request lengths, and the scheduler is set to `vllm_v1` which has been taken from the [vLLM V1](https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py).
122+
or to get information on all parameters,
84123

85-
__The simulator supports a plethora of parameters for different simulation scenarios, see [docs/how_to_run.md](docs/how_to_run.md). Also run `python -m vidur.main -n` to get helptext on all parameters.__
124+
```sh
125+
python -m vidur.main -h
126+
```
86127

87128
## Simulator Output
88129

@@ -99,6 +140,10 @@ To format code, execute the following command:
99140
make format
100141
```
101142

143+
## Using Canary Build
144+
145+
We have been working on several improvements for the simulator, including support for prefix caching, different routing policies, reducing memory requirements for the simulator, etc. However, there are some sharp edges that we are working on resolving. In the meantime, if you are looking for support for any of these features, please use the `canary` branch.
146+
102147
## Contributing
103148

104149
This project welcomes contributions and suggestions. Most contributions require you to agree to a
@@ -120,3 +165,4 @@ trademarks or logos is subject to and must follow
120165
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
121166
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
122167
Any use of third-party trademarks or logos are subject to those third-party's policies.
168+

assets/batch_size.png

-11.6 KB
Loading
8.4 KB
Loading

assets/prefill_e2e_time.png

11.6 KB
Loading

assets/request_e2e_time.png

12.6 KB
Loading

0 commit comments

Comments
 (0)