Skip to content

[Feature] Support GQA SWA attention and v_head_dim KV cache#8041

Open
chang-wenbin wants to merge 16 commits into
PaddlePaddle:developfrom
chang-wenbin:ALL-GQA_SWA
Open

[Feature] Support GQA SWA attention and v_head_dim KV cache#8041
chang-wenbin wants to merge 16 commits into
PaddlePaddle:developfrom
chang-wenbin:ALL-GQA_SWA

Conversation

@chang-wenbin

Copy link
Copy Markdown
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@codecov-commenter

codecov-commenter commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 66.87117% with 54 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@ecd9733). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/model_executor/layers/linear.py 54.09% 26 Missing and 2 partials ⚠️
...l_executor/layers/attention/append_attn_backend.py 52.38% 7 Missing and 3 partials ⚠️
fastdeploy/cache_manager/v1/cache_controller.py 76.66% 6 Missing and 1 partial ⚠️
fastdeploy/worker/gpu_model_runner.py 84.84% 4 Missing and 1 partial ⚠️
...eploy/model_executor/layers/attention/attention.py 77.77% 1 Missing and 1 partial ⚠️
fastdeploy/config.py 50.00% 0 Missing and 1 partial ⚠️
...astdeploy/model_executor/ops/triton_ops/do_rope.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #8041   +/-   ##
==========================================
  Coverage           ?   67.75%           
==========================================
  Files              ?      475           
  Lines              ?    66694           
  Branches           ?    10284           
==========================================
  Hits               ?    45189           
  Misses             ?    18613           
  Partials           ?     2892           
Flag Coverage Δ
GPU 77.76% <66.87%> (?)
XPU 6.97% <1.22%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

zhoutianzi666
zhoutianzi666 previously approved these changes Jun 12, 2026
@chang-wenbin chang-wenbin changed the title All gqa swa [Feature] Support GQA SWA attention and v_head_dim KV cache Jun 12, 2026

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-13 01:00:28

📋 Review 摘要

PR 概述:支持 GQA/SWA attention、per-layer KV head attention backend,以及 v_head_dim 感知的 KV cache / QKV loader 路径。
变更范围config.py、attention backend/op、KV cache manager、GPU model runner、QKV/QKVG linear loader、PaddleFormers 配置同步及相关测试。
影响面 Tag[FDConfig] [OP] [KVCache] [Models] [Loader]

问题

级别 文件 概述
🔴 Bug fastdeploy/model_executor/layers/linear.py:799 shared KV 的非 fused Q/K/V 加载仍按 head_dim 切 V 权重,v_head_dim != head_dim 时会切错或 reshape 失败

历史 Findings 修复情况

Finding 问题 状态
F1 QKVGateParallelLinear 的 TP 分片加载会调用不存在的 _get_kv_shard_id() ⚠️ 仍存在
F2 sliding attention 配置会被当前 window_attn_skip_freq 判断绕过 ⚠️ 仍存在
F3 num_key_value_heads_list 未传递到 QKV 投影层 ⚠️ 仍存在
F4 QKVGateParallelLinear.qkv_weight_loader()is_scale=True param shard offset/size 未同步缩放 ✅ 已修复

📝 PR 规范检查

标题已补官方 [Feature] Tag;PR 描述仍保留模板占位内容,Checklist 未按实际情况勾选。

标题建议(可直接复制):

  • [OP] Support GQA SWA attention and v_head_dim KV cache
PR 描述建议(点击展开,可直接复制)
## Motivation
Support GQA/SWA attention variants, per-layer KV head counts, and `v_head_dim` aware KV cache/weight loading paths.

## Modifications
- Add group size 3 append attention template dispatch coverage.
- Support per-layer attention backends and per-layer KV cache shapes.
- Add `v_head_dim` propagation to config, PaddleFormers sync, attention cache shape, RoPE/cache write path, and QKV/QKVG linear loading.
- Adjust append attention SWA handling and related tests.

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

本轮按风险优先覆盖了 attention/KV cache/loader 关键路径,并确认最近提交修复了一个历史量化 scale offset 问题。但 v_head_dim 新增语义在非 fused Q/K/V loader 的 shared KV 分支仍有明确错误,需要修复后再合入。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants