Add support for chunked attention (#597) #683

jkaniecki · 2025-12-04T10:55:20Z

Cherry-pick of
6e1be4e

Cherry-pick of vllm-project@6e1be4e --------- Signed-off-by: Jan Kaniecki <[email protected]> Signed-off-by: Jan Kaniecki <[email protected]> Co-authored-by: Copilot <[email protected]>

Copilot

Pull request overview

This PR adds support for chunked attention to the vLLM-Gaudi implementation, cherry-picked from the upstream vllm-gaudi repository. Chunked attention divides attention computation into smaller chunks, which can help with memory efficiency and performance for long sequences.

Key changes:

Added chunked attention bias computation for both prefill and decode phases
Extended attention metadata structures to include chunked attention fields
Integrated chunked attention configuration detection and layer setup

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
vllm_gaudi/v1/worker/hpu_model_runner.py	Core implementation of chunked attention including bias computation, block mapping, metadata updates, and model initialization logic
vllm_gaudi/v1/attention/backends/hpu_attn.py	Updated decode metadata factory method to accept chunked attention parameters
vllm_gaudi/attention/backends/hpu_attn.py	Added chunked attention metadata fields and logic to select appropriate attention blocks during forward pass

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vllm_gaudi/v1/worker/hpu_model_runner.py

vllm_gaudi/attention/backends/hpu_attn.py

github-actions · 2025-12-04T17:27:06Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
1b7c7f5159484063af28cb47809d79e83d3301ec

Add support for chunked attention (vllm-project#597)

fabf8fa

Cherry-pick of vllm-project@6e1be4e --------- Signed-off-by: Jan Kaniecki <[email protected]> Signed-off-by: Jan Kaniecki <[email protected]> Co-authored-by: Copilot <[email protected]>

Copilot AI review requested due to automatic review settings December 4, 2025 10:55

jkaniecki requested review from adobrzyn, afierka-intel, iboiko-habana, kamil-kaczor, ksmusz, kzawora-intel, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners December 4, 2025 10:55

Copilot AI reviewed Dec 4, 2025

View reviewed changes

vllm_gaudi/v1/worker/hpu_model_runner.py Show resolved Hide resolved

vllm_gaudi/v1/worker/hpu_model_runner.py Show resolved Hide resolved

vllm_gaudi/attention/backends/hpu_attn.py Show resolved Hide resolved

vllm_gaudi/attention/backends/hpu_attn.py Show resolved Hide resolved

Update hpu_attn.py

ffdb483

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for chunked attention (#597) #683

Add support for chunked attention (#597) #683

Uh oh!

jkaniecki commented Dec 4, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add support for chunked attention (#597) #683

Are you sure you want to change the base?

Add support for chunked attention (#597) #683

Uh oh!

Conversation

jkaniecki commented Dec 4, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 4, 2025

✅ CI Passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant