-
Notifications
You must be signed in to change notification settings - Fork 104
Add ovis models support with default buckets #846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Wang, Zheng W <[email protected]>
Signed-off-by: Wang, Zheng W <[email protected]>
✅ CI PassedAll checks passed successfully against the following vllm commit: |
| @@ -0,0 +1,12 @@ | |||
| from vllm.config import VllmConfig | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copyright header missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, thanks for the review
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
Signed-off-by: Wang, Zheng W <[email protected]>
Adds Qwen3 model test case for image Signed-off-by: slokesha <[email protected]> Co-authored-by: Iryna Boiko <[email protected]> Signed-off-by: Wang, Zheng W <[email protected]>
Following reasoning stated in PR: vllm-project#616 Signed-off-by: Radoslaw Smyrek <[email protected]> Signed-off-by: Wang, Zheng W <[email protected]>
…ect#837) Signed-off-by: linoy buchnik <[email protected]> Signed-off-by: Iryna Boiko <[email protected]> Co-authored-by: Iryna Boiko <[email protected]> Signed-off-by: Wang, Zheng W <[email protected]>
Adds support for cross-layer KV cache sharing on HPU, enabling models like Gemma-3n that share KV cache between layers to run on Gaudi. **Changes** - hpu_attn.py: Store kv_sharing_target_layer_name and skip KV cache writes for sharing layers - hpu_model_runner.py: Track shared layers, validate config, and set up tensor sharing during initialization - test_hpu_model_runner.py: Enable KV sharing unit tests **Expected Benefits** Reduced KV cache memory usage for models with layer sharing Lower TTFT for long-context scenarios in supported models (e.g., Gemma-3n) **Testing** Unit tests pass E2E validation with a KV-sharing model (e.g., Gemma-3n) pending --------- Signed-off-by: jakub-sochacki <[email protected]> Co-authored-by: jakub-sochacki <[email protected]> Signed-off-by: Wang, Zheng W <[email protected]>
Signed-off-by: Iryna Boiko <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Signed-off-by: Wang, Zheng W <[email protected]>
## Motivation Qwen2.5-VL models have lower accuracy than expected, and this accuracy regressed due to PR vllm-project#698 (commit 18105cc on main). This PR introduces too changes to boost accuracy on Qwen2.5-VL-7B-Instruct on MMMU dataset from ~42% to 51%. The accuracy matches that seen on GPU version of vLLM (build 0.13.0) under similar test conditions. ## Changes - First change is a fix for the regression. The attn_mask was not being used in HPUQwen2_5_VisionBlock. - The second change is enabling fp32_softmax for qwen2_5_vl models. --------- Signed-off-by: Tanner Voas <[email protected]> Signed-off-by: Wang, Zheng W <[email protected]>
Signed-off-by: Xinyu Chen <[email protected]> Co-authored-by: Yaser Afshar <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Signed-off-by: Wang, Zheng W <[email protected]>
Signed-off-by: Milosz Grunwald <[email protected]> Signed-off-by: Wang, Zheng W <[email protected]>
Cherry-pick of vllm-project@6e1be4e but adapted to recent changes in vllm-project#526 --------- Signed-off-by: Katarzyna Fojcik <[email protected]> Signed-off-by: Wang, Zheng W <[email protected]>
…llm-project#855) Llama4 for `max_model_len > 32k` enable temperature adjustment https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719. Enabled adjustment causes tensor `q` shape modification from 2D to 3D: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307. This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`: https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163 causing invalid reshaping - we trying to return a 3D `output.view` based on 2D output tensor. Found that following PR introduced the bug: vllm-project#680 and vllm-project#684 Cherry-picked from `releases/v0.13.0` --------- Signed-off-by: Artur Fierka <[email protected]> Signed-off-by: Wang, Zheng W <[email protected]>
…vllm-project#852) Signed-off-by: Dudi Lester <[email protected]> Co-authored-by: Kamil Kaczor <[email protected]> Signed-off-by: Wang, Zheng W <[email protected]>
Reverts vllm-project#780 --------- Signed-off-by: Agata Dobrzyniewicz <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Signed-off-by: Wang, Zheng W <[email protected]>
Due to MambaMixer2 implementation requirements, all buckets used for mamba must be a multiple of mamba chunk size. Signed-off-by: Jakub Byczkowski <[email protected]> Signed-off-by: Wang, Zheng W <[email protected]>
…ject#785) Further experiments on top of vllm-project#784 - I wanted to check if we can avoid some OOMs by performing FlashAttention rescaling online rather than after computing all the parts - should save us memory on some intermediate buffers. Accuracy is surprisingly okay-ish, but I haven't tested this too thouroughly. --------- Signed-off-by: Konrad Zawora <[email protected]> Co-authored-by: Agata Dobrzyniewicz <[email protected]> Signed-off-by: Wang, Zheng W <[email protected]>
…-project#867) 1. update example to support prefill HND and agreed_block_size 2. enable prefill side kv_layout and block_size update Port vllm-project/vllm#30448 to vllm-gaudi --------- Signed-off-by: Chendi Xue <[email protected]> Signed-off-by: Yeonsil Yoon <[email protected]> Signed-off-by: Wang, Zheng W <[email protected]>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
|
I don't think my PR breaks the CI, anyone help here? |
✅ CI PassedAll checks passed successfully against the following vllm commit: |
Signed-off-by: Wang, Zheng W <[email protected]>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
✅ CI PassedAll checks passed successfully against the following vllm commit: |
No description provided.