You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Bugfix]: Revert scheduler regression and introduce canary branch (microsoft#65)
* Revert "[Core][Doc][CI/Build][Bugfix][Profiling] Multi-replica routing polices, prefix caching, `uv`, and a much faster and lighter Vidur (microsoft#56)"
This reverts commit a815fd0.
* minor
* minor
---------
Co-authored-by: Amey Agrawal <ameyagrawal@ipsec-10-2-129-73.vpn.gatech.edu>
*__Instructions on adding a new model to existing or new SKUs can be found [here](docs/profiling.md)__.
34
-
* All models support a maximum context length of 4k except `Llama3-8B` and `Llama3-70B` which support 16k context length.
35
+
* All models support a maximum context length of 4k except `Llama3-8B` and `Llama3-70B` which support 16k context length by passing additional CLI params:
* Pipeline parallelism is supported for all models. The PP dimension should divide the number of layers in the model.
36
44
* In DGX nodes, there are 8 GPUs, fully connected via NVLink. So TP1, TP2, TP4 and TP8 are supported.
37
45
* In 4x pairwise NVLink nodes, there are 4 GPUs, so TP1, TP2 and TP4 are supported. TP4 here is less performant than TP4 in DGX nodes because (GPU1, GPU2) are connected via NVLink and (GPU3, GPU4) are connected via NVLink. but between these layers, the interconnect is slower.
38
46
* You can use any combination of TP and PP. For example, you can run LLaMA2-70B on TP2-PP2 on a 4xA100 80GB Pairwise NVLink Node.
39
47
40
-
## Setup (using `uv`)
48
+
## Setup
49
+
50
+
### Using `mamba`
51
+
52
+
To run the simulator, create a mamba environment with the given dependency file.
53
+
54
+
```sh
55
+
mamba env create -p ./env -f ./environment.yml
56
+
mamba env update -f environment-dev.yml
57
+
```
58
+
59
+
### Using `venv`
60
+
61
+
1. Ensure that you have Python 3.10 installed on your system. Refer <https://www.bitecode.dev/p/installing-python-the-bare-minimum>
62
+
2.`cd` into the repository root
63
+
3. Create a virtual environment using `venv` module using `python3.10 -m venv .venv`
64
+
4. Activate the virtual environment using `source .venv/bin/activate`
65
+
5. Install the dependencies using `python -m pip install -r requirements.txt`
66
+
6. Run `deactivate` to deactivate the virtual environment
67
+
68
+
### Using `conda` (Least recommended)
69
+
70
+
To run the simulator, create a conda environment with the given dependency file.
2. At project root, run `uv venv` to create a new virtual environment.
44
-
3. Activate the environment using `source .venv/bin/activate`.
45
-
4. Install dependencies using `uv sync`. The environment is now ready for use.
72
+
```sh
73
+
conda env create -p ./env -f ./environment.yml
74
+
conda env update -f environment-dev.yml
75
+
```
46
76
47
-
## Setting up wandb (Optional)
77
+
###Setting up wandb (Optional)
48
78
49
79
First, setup your account on `https://<your-org>.wandb.io/` or public wandb, obtain the api key and then run the following command,
50
80
51
81
```sh
52
82
wandb login --host https://<your-org>.wandb.io
53
83
```
54
84
55
-
To opt out of wandb, set `export WANDB_MODE=disabled` in your shell or add this in `~/.zshrc` or `~/.bashrc`. Remember to reload using `source ~/.zshrc` or `source ~/.bashrc`.
85
+
To opt out of wandb, pick any one of the following methods:
86
+
87
+
1.`export WANDB_MODE=disabled` in your shell or add this in `~/.zshrc` or `~/.bashrc`. Remember to reload using `source ~/.zshrc`.
88
+
2. Set `wandb_project` and `wandb_group` as `""` in `vidur/config/default.yml`. Also, remove these CLI params from the shell command with which the simulator is invoked.
56
89
57
90
## Running the simulator
58
91
59
-
To run the simulator, execute the following command from the repository root:
92
+
To run the simulator, execute the following command from the repository root,
The command above simulates a scenario with a H100 DGX node running 8 replicas of the `Meta-Llama-3-8B` model, with synthetic requests generated at a QPS of 8. The `mooncake_conversation` trace file is used for request lengths, and the scheduler is set to `vllm_v1` which has been taken from the [vLLM V1](https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py).
122
+
or to get information on all parameters,
84
123
85
-
__The simulator supports a plethora of parameters for different simulation scenarios, see [docs/how_to_run.md](docs/how_to_run.md). Also run `python -m vidur.main -n` to get helptext on all parameters.__
124
+
```sh
125
+
python -m vidur.main -h
126
+
```
86
127
87
128
## Simulator Output
88
129
@@ -99,6 +140,10 @@ To format code, execute the following command:
99
140
make format
100
141
```
101
142
143
+
## Using Canary Build
144
+
145
+
We have been working on several improvements for the simulator, including support for prefix caching, different routing policies, reducing memory requirements for the simulator, etc. However, there are some sharp edges that we are working on resolving. In the meantime, if you are looking for support for any of these features, please use the `canary` branch.
146
+
102
147
## Contributing
103
148
104
149
This project welcomes contributions and suggestions. Most contributions require you to agree to a
@@ -120,3 +165,4 @@ trademarks or logos is subject to and must follow
0 commit comments