You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- block size max (`VLLM_DECODE_BLOCK_BUCKET_MAX`): `max(128, (max_num_seqs*2048)/block_size)`
45
-
- Recommended Values:
46
-
- Prompt:
47
-
- sequence length max (`VLLM_PROMPT_SEQ_BUCKET_MAX`): `max_model_len`
48
-
- Decode:
49
-
50
-
- block size max (`VLLM_DECODE_BLOCK_BUCKET_MAX`): `max(128, (max_num_seqs*max_model_len/block_size)`
19
+
-`VLLM_SKIP_WARMUP`: if `true`, warmup is skipped. The default is `false`.
51
20
52
21
!!! note
53
22
If the model config reports a high `max_model_len`, set it to max `input_tokens+output_tokens` rounded up to a multiple of `block_size` as per actual requirements.
@@ -69,3 +38,28 @@ Additionally, there are HPU PyTorch Bridge environment variables impacting vLLM
69
38
-`PT_HPU_ENABLE_LAZY_COLLECTIVES`: must be set to `true` for tensor parallel inference with HPU Graphs. The default is `true`.
70
39
-`PT_HPUGRAPH_DISABLE_TENSOR_CACHE`: must be set to `false` for LLaVA, qwen, and RoBERTa models. The default is `false`.
71
40
-`VLLM_PROMPT_USE_FLEX_ATTENTION`: enabled only for the Llama model, allowing usage of `torch.nn.attention.flex_attention` instead of FusedSDPA. Requires `VLLM_PROMPT_USE_FUSEDSDPA=0`. The default is `false`.
41
+
42
+
**Additional Performance Tuning Knobs - Linear Bucketing Strategy only:**
43
+
44
+
-`VLLM_{phase}_{dim}_BUCKET_{param}`: collection of 12 environment variables configuring ranges of bucketing mechanism (linear bucketing only).
45
+
-`{phase}` is either `PROMPT` or `DECODE`
46
+
-`{dim}` is either `BS`, `SEQ` or `BLOCK`
47
+
-`{param}` is either `MIN`, `STEP` or `MAX`
48
+
- Default values:
49
+
- Prompt:
50
+
- batch size min (`VLLM_PROMPT_BS_BUCKET_MIN`): `1`
0 commit comments