Skip to content

Commit 2fb9f9e

Browse files
ksmuszadobrzynyafshar
authored
[Docs] README update - bucketing, warmup, defragmenter and sampler warmup (#354)
Porting #231 and #278 --------- Signed-off-by: Krzysztof Smusz <[email protected]> Co-authored-by: Agata Dobrzyniewicz <[email protected]> Co-authored-by: Yaser Afshar <[email protected]>
1 parent 21c5ebd commit 2fb9f9e

File tree

5 files changed

+331
-138
lines changed

5 files changed

+331
-138
lines changed
97.6 KB
Loading
24.3 KB
Loading

docs/configuration/env_vars.md

Lines changed: 26 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -14,40 +14,9 @@
1414

1515
**Performance Tuning Knobs:**
1616

17-
- `VLLM_SKIP_WARMUP`: if `true`, warmup is skipped. The default is `false`.
1817
- `VLLM_GRAPH_RESERVED_MEM`: percentage of memory dedicated to HPUGraph capture. The default is `0.1`.
19-
- `VLLM_GRAPH_PROMPT_RATIO`: percentage of reserved graph memory dedicated to prompt graphs. The default is `0.3`.
20-
- `VLLM_GRAPH_PROMPT_STRATEGY`: strategy determining order of prompt graph capture, `min_tokens` or `max_bs`. The default is `min_tokens`.
21-
- `VLLM_GRAPH_DECODE_STRATEGY`: strategy determining order of decode graph capture, `min_tokens` or `max_bs`. The default is `max_bs`.
2218
- `VLLM_EXPONENTIAL_BUCKETING`: if `true`, enables exponential bucket spacing instead of linear. The default is `true`.
23-
- `VLLM_{phase}_{dim}_BUCKET_{param}`: collection of 12 environment variables configuring ranges of bucketing mechanism (linear bucketing only).
24-
- `{phase}` is either `PROMPT` or `DECODE`
25-
- `{dim}` is either `BS`, `SEQ` or `BLOCK`
26-
- `{param}` is either `MIN`, `STEP` or `MAX`
27-
- Default values:
28-
- Prompt:
29-
- batch size min (`VLLM_PROMPT_BS_BUCKET_MIN`): `1`
30-
- batch size step (`VLLM_PROMPT_BS_BUCKET_STEP`): `min(max_num_seqs, 32)`
31-
- batch size max (`VLLM_PROMPT_BS_BUCKET_MAX`): `min(max_num_seqs, 64)`
32-
- sequence length min (`VLLM_PROMPT_SEQ_BUCKET_MIN`): `block_size`
33-
- sequence length step (`VLLM_PROMPT_SEQ_BUCKET_STEP`): `block_size`
34-
- sequence length max (`VLLM_PROMPT_SEQ_BUCKET_MAX`): `1024`
35-
- sequence ctx min (`VLLM_PROMPT_CTX_BUCKET_MIN`): `0`
36-
- sequence ctx step (`VLLM_PROMPT_CTX_BUCKET_STEP`): `1`
37-
- sequence ctx max (`VLLM_PROMPT_CTX_BUCKET_MAX`): `(max_model_len - block_size) // block_size`
38-
- Decode:
39-
- batch size min (`VLLM_DECODE_BS_BUCKET_MIN`): `1`
40-
- batch size step (`VLLM_DECODE_BS_BUCKET_STEP`): `min(max_num_seqs, 32)`
41-
- batch size max (`VLLM_DECODE_BS_BUCKET_MAX`): `max_num_seqs`
42-
- block size min (`VLLM_DECODE_BLOCK_BUCKET_MIN`): `block_size`
43-
- block size step (`VLLM_DECODE_BLOCK_BUCKET_STEP`): `block_size`
44-
- block size max (`VLLM_DECODE_BLOCK_BUCKET_MAX`): `max(128, (max_num_seqs*2048)/block_size)`
45-
- Recommended Values:
46-
- Prompt:
47-
- sequence length max (`VLLM_PROMPT_SEQ_BUCKET_MAX`): `max_model_len`
48-
- Decode:
49-
50-
- block size max (`VLLM_DECODE_BLOCK_BUCKET_MAX`): `max(128, (max_num_seqs*max_model_len/block_size)`
19+
- `VLLM_SKIP_WARMUP`: if `true`, warmup is skipped. The default is `false`.
5120

5221
!!! note
5322
If the model config reports a high `max_model_len`, set it to max `input_tokens+output_tokens` rounded up to a multiple of `block_size` as per actual requirements.
@@ -69,3 +38,28 @@ Additionally, there are HPU PyTorch Bridge environment variables impacting vLLM
6938
- `PT_HPU_ENABLE_LAZY_COLLECTIVES`: must be set to `true` for tensor parallel inference with HPU Graphs. The default is `true`.
7039
- `PT_HPUGRAPH_DISABLE_TENSOR_CACHE`: must be set to `false` for LLaVA, qwen, and RoBERTa models. The default is `false`.
7140
- `VLLM_PROMPT_USE_FLEX_ATTENTION`: enabled only for the Llama model, allowing usage of `torch.nn.attention.flex_attention` instead of FusedSDPA. Requires `VLLM_PROMPT_USE_FUSEDSDPA=0`. The default is `false`.
41+
42+
**Additional Performance Tuning Knobs - Linear Bucketing Strategy only:**
43+
44+
- `VLLM_{phase}_{dim}_BUCKET_{param}`: collection of 12 environment variables configuring ranges of bucketing mechanism (linear bucketing only).
45+
- `{phase}` is either `PROMPT` or `DECODE`
46+
- `{dim}` is either `BS`, `SEQ` or `BLOCK`
47+
- `{param}` is either `MIN`, `STEP` or `MAX`
48+
- Default values:
49+
- Prompt:
50+
- batch size min (`VLLM_PROMPT_BS_BUCKET_MIN`): `1`
51+
- batch size step (`VLLM_PROMPT_BS_BUCKET_STEP`): `32`
52+
- batch size max (`VLLM_PROMPT_BS_BUCKET_MAX`): `max_num_prefill_seqs`
53+
- sequence length min (`VLLM_PROMPT_SEQ_BUCKET_MIN`): `block_size`
54+
- sequence length step (`VLLM_PROMPT_SEQ_BUCKET_STEP`): `block_size`
55+
- sequence length max (`VLLM_PROMPT_SEQ_BUCKET_MAX`): `max_model_len`
56+
- sequence ctx min (`VLLM_PROMPT_CTX_BUCKET_MIN`): `0`
57+
- sequence ctx step (`VLLM_PROMPT_CTX_BUCKET_STEP`): `1`
58+
- sequence ctx max (`VLLM_PROMPT_CTX_BUCKET_MAX`): `(max_model_len - block_size) // block_size`
59+
- Decode:
60+
- batch size min (`VLLM_DECODE_BS_BUCKET_MIN`): `1`
61+
- batch size step (`VLLM_DECODE_BS_BUCKET_STEP`): `32`
62+
- batch size max (`VLLM_DECODE_BS_BUCKET_MAX`): `max_num_seqs`
63+
- block size min (`VLLM_DECODE_BLOCK_BUCKET_MIN`): `block_size`
64+
- block size step (`VLLM_DECODE_BLOCK_BUCKET_STEP`): `block_size`
65+
- block size max (`VLLM_DECODE_BLOCK_BUCKET_MAX`): `max_blocks`

0 commit comments

Comments
 (0)