Skip to content

Conversation

@testdig
Copy link

@testdig testdig commented Jan 21, 2026

No description provided.

@github-actions
Copy link

✅ CI Passed

All checks passed successfully against the following vllm commit:
6218034dd7f9a56596e4fd8c8c8fc1d8011ed9c2

@@ -0,0 +1,12 @@
from vllm.config import VllmConfig
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyright header missing

Copy link
Author

@testdig testdig Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks for the review

@github-actions
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

testdig and others added 14 commits January 29, 2026 09:34
Signed-off-by: Wang, Zheng W <[email protected]>
Adds  Qwen3 model test case for image

Signed-off-by: slokesha <[email protected]>
Co-authored-by: Iryna Boiko <[email protected]>
Signed-off-by: Wang, Zheng W <[email protected]>
Following reasoning stated in PR:
vllm-project#616

Signed-off-by: Radoslaw Smyrek <[email protected]>
Signed-off-by: Wang, Zheng W <[email protected]>
…ect#837)

Signed-off-by: linoy buchnik <[email protected]>
Signed-off-by: Iryna Boiko <[email protected]>
Co-authored-by: Iryna Boiko <[email protected]>
Signed-off-by: Wang, Zheng W <[email protected]>
Adds support for cross-layer KV cache sharing on HPU, enabling models
like Gemma-3n that share KV cache between layers to run on Gaudi.

**Changes**
- hpu_attn.py: Store kv_sharing_target_layer_name and skip KV cache
writes for sharing layers
- hpu_model_runner.py: Track shared layers, validate config, and set up
tensor sharing during initialization
- test_hpu_model_runner.py: Enable KV sharing unit tests

**Expected Benefits**
Reduced KV cache memory usage for models with layer sharing
Lower TTFT for long-context scenarios in supported models (e.g.,
Gemma-3n)

**Testing**
Unit tests pass
E2E validation with a KV-sharing model (e.g., Gemma-3n) pending

---------

Signed-off-by: jakub-sochacki <[email protected]>
Co-authored-by: jakub-sochacki <[email protected]>
Signed-off-by: Wang, Zheng W <[email protected]>
Signed-off-by: Iryna Boiko <[email protected]>
Co-authored-by: Chendi.Xue <[email protected]>
Signed-off-by: Wang, Zheng W <[email protected]>
## Motivation
Qwen2.5-VL models have lower accuracy than expected, and this accuracy
regressed due to PR vllm-project#698 (commit
18105cc on main). This PR introduces
too changes to boost accuracy on Qwen2.5-VL-7B-Instruct on MMMU dataset
from ~42% to 51%. The accuracy matches that seen on GPU version of vLLM
(build 0.13.0) under similar test conditions.

## Changes
- First change is a fix for the regression. The attn_mask was not being
used in HPUQwen2_5_VisionBlock.
- The second change is enabling fp32_softmax for qwen2_5_vl models.

---------

Signed-off-by: Tanner Voas <[email protected]>
Signed-off-by: Wang, Zheng W <[email protected]>
Signed-off-by: Xinyu Chen <[email protected]>
Co-authored-by: Yaser Afshar <[email protected]>
Co-authored-by: Chendi.Xue <[email protected]>
Signed-off-by: Wang, Zheng W <[email protected]>
Signed-off-by: Milosz Grunwald <[email protected]>
Signed-off-by: Wang, Zheng W <[email protected]>
Cherry-pick of

vllm-project@6e1be4e
but adapted to recent changes in
vllm-project#526

---------

Signed-off-by: Katarzyna Fojcik <[email protected]>
Signed-off-by: Wang, Zheng W <[email protected]>
…llm-project#855)

Llama4 for `max_model_len > 32k` enable temperature adjustment
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L719.
Enabled adjustment causes tensor `q` shape modification from 2D to 3D:
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L307.
This tensor is passing to `UnqnatizedFusedMoEMetod -> forward`:
https://github.com/vllm-project/vllm-gaudi/blob/main/vllm_gaudi/ops/hpu_fused_moe.py#L163
causing invalid reshaping - we trying to return a 3D `output.view` based
on 2D output tensor.

Found that following PR introduced the bug: vllm-project#680 and vllm-project#684

Cherry-picked from `releases/v0.13.0`

---------

Signed-off-by: Artur Fierka <[email protected]>
Signed-off-by: Wang, Zheng W <[email protected]>
Reverts vllm-project#780

---------

Signed-off-by: Agata Dobrzyniewicz <[email protected]>
Co-authored-by: Chendi.Xue <[email protected]>
Signed-off-by: Wang, Zheng W <[email protected]>
Due to MambaMixer2 implementation requirements, all buckets used for
mamba must be a multiple of mamba chunk size.

Signed-off-by: Jakub Byczkowski <[email protected]>
Signed-off-by: Wang, Zheng W <[email protected]>
kzawora-intel and others added 2 commits January 29, 2026 09:34
…ject#785)

Further experiments on top of
vllm-project#784 - I wanted to check
if we can avoid some OOMs by performing FlashAttention rescaling online
rather than after computing all the parts - should save us memory on
some intermediate buffers. Accuracy is surprisingly okay-ish, but I
haven't tested this too thouroughly.

---------

Signed-off-by: Konrad Zawora <[email protected]>
Co-authored-by: Agata Dobrzyniewicz <[email protected]>
Signed-off-by: Wang, Zheng W <[email protected]>
…-project#867)

1. update example to support prefill HND and agreed_block_size
2. enable prefill side kv_layout and block_size update

Port vllm-project/vllm#30448 to vllm-gaudi

---------

Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Yeonsil Yoon <[email protected]>
Signed-off-by: Wang, Zheng W <[email protected]>
@github-actions
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@testdig
Copy link
Author

testdig commented Jan 29, 2026

I don't think my PR breaks the CI, anyone help here?

@github-actions
Copy link

✅ CI Passed

All checks passed successfully against the following vllm commit:
6218034dd7f9a56596e4fd8c8c8fc1d8011ed9c2

Signed-off-by: Wang, Zheng W <[email protected]>
@github-actions
Copy link

github-actions bot commented Feb 2, 2026

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@github-actions
Copy link

github-actions bot commented Feb 4, 2026

✅ CI Passed

All checks passed successfully against the following vllm commit:
17b17c068453e6dc6af79240bb94857ae175cc51

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.