[Feature] Support GQA SWA attention and v_head_dim KV cache by chang-wenbin · Pull Request #8041 · PaddlePaddle/FastDeploy

chang-wenbin · 2026-06-12T03:19:37Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

codecov-commenter · 2026-06-12T03:54:38Z

Codecov Report

❌ Patch coverage is 66.87117% with 54 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@ecd9733). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/model_executor/layers/linear.py	54.09%	26 Missing and 2 partials ⚠️
...l_executor/layers/attention/append_attn_backend.py	52.38%	7 Missing and 3 partials ⚠️
fastdeploy/cache_manager/v1/cache_controller.py	76.66%	6 Missing and 1 partial ⚠️
fastdeploy/worker/gpu_model_runner.py	84.84%	4 Missing and 1 partial ⚠️
...eploy/model_executor/layers/attention/attention.py	77.77%	1 Missing and 1 partial ⚠️
fastdeploy/config.py	50.00%	0 Missing and 1 partial ⚠️
...astdeploy/model_executor/ops/triton_ops/do_rope.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #8041   +/-   ##
==========================================
  Coverage           ?   67.75%           
==========================================
  Files              ?      475           
  Lines              ?    66694           
  Branches           ?    10284           
==========================================
  Hits               ?    45189           
  Misses             ?    18613           
  Partials           ?     2892

Flag	Coverage Δ
GPU	`77.76% <66.87%> (?)`
XPU	`6.97% <1.22%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-06-13 01:00:28

📋 Review 摘要

PR 概述：支持 GQA/SWA attention、per-layer KV head attention backend，以及 v_head_dim 感知的 KV cache / QKV loader 路径。
变更范围：config.py、attention backend/op、KV cache manager、GPU model runner、QKV/QKVG linear loader、PaddleFormers 配置同步及相关测试。
影响面 Tag：[FDConfig] [OP] [KVCache] [Models] [Loader]

问题

级别	文件	概述
🔴 Bug	`fastdeploy/model_executor/layers/linear.py:799`	shared KV 的非 fused Q/K/V 加载仍按 `head_dim` 切 V 权重，`v_head_dim != head_dim` 时会切错或 reshape 失败

历史 Findings 修复情况

Finding	问题	状态
F1	`QKVGateParallelLinear` 的 TP 分片加载会调用不存在的 `_get_kv_shard_id()`	⚠️ 仍存在
F2	sliding attention 配置会被当前 `window_attn_skip_freq` 判断绕过	⚠️ 仍存在
F3	`num_key_value_heads_list` 未传递到 QKV 投影层	⚠️ 仍存在
F4	`QKVGateParallelLinear.qkv_weight_loader()` 的 `is_scale=True` param shard offset/size 未同步缩放	✅ 已修复

📝 PR 规范检查

标题已补官方 [Feature] Tag；PR 描述仍保留模板占位内容，Checklist 未按实际情况勾选。

标题建议（可直接复制）：

[OP] Support GQA SWA attention and v_head_dim KV cache

PR 描述建议（点击展开，可直接复制）

## Motivation
Support GQA/SWA attention variants, per-layer KV head counts, and `v_head_dim` aware KV cache/weight loading paths.

## Modifications
- Add group size 3 append attention template dispatch coverage.
- Support per-layer attention backends and per-layer KV cache shapes.
- Add `v_head_dim` propagation to config, PaddleFormers sync, attention cache shape, RoPE/cache write path, and QKV/QKVG linear loading.
- Adjust append attention SWA handling and related tests.

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

本轮按风险优先覆盖了 attention/KV cache/loader 关键路径，并确认最近提交修复了一个历史量化 scale offset 问题。但 v_head_dim 新增语义在非 fused Q/K/V loader 的 shared KV 分支仍有明确错误，需要修复后再合入。

chang-wenbin added 12 commits June 8, 2026 20:46

support qkdim!=vdim

44f547c

Merge remote-tracking branch 'origin/develop' into qkdim_vdim

fc06ae2

Merge remote-tracking branch 'origin/develop' into qkdim_vdim

19a7044

support gqa qkdim=192 vdim=128

2958fda

support qkdim!=vdim

303ad42

merge develop

29dce63

update v_head_dim

db7d260

fix & update

94fa1f9

diff_kv_head

b0a901e

update kvhead8-4

d7bda46

update kvhead8-4

ec0be7d

update gqa_swa

3e07456

chang-wenbin had a problem deploying to Metax_ci June 12, 2026 03:19 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

fix bug

c1124b4

chang-wenbin had a problem deploying to Metax_ci June 12, 2026 06:46 — with GitHub Actions Failure

fix bug

779eac0

chang-wenbin had a problem deploying to Metax_ci June 12, 2026 07:00 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

merge develop

a19e8a5

chang-wenbin had a problem deploying to Metax_ci June 12, 2026 07:32 — with GitHub Actions Failure

chang-wenbin requested a review from PaddlePaddle-bot June 12, 2026 07:37

This comment was marked as outdated.

Sign in to view

zhoutianzi666 previously approved these changes Jun 12, 2026

View reviewed changes

chang-wenbin changed the title ~~All gqa swa~~ [Feature] Support GQA SWA attention and v_head_dim KV cache Jun 12, 2026

fix some bug

7e235b7

chang-wenbin dismissed zhoutianzi666’s stale review via 7e235b7 June 12, 2026 16:03

chang-wenbin had a problem deploying to Metax_ci June 12, 2026 16:03 — with GitHub Actions Failure

chang-wenbin requested a review from PaddlePaddle-bot June 12, 2026 16:28

PaddlePaddle-bot suggested changes Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support GQA SWA attention and v_head_dim KV cache#8041

[Feature] Support GQA SWA attention and v_head_dim KV cache#8041
chang-wenbin wants to merge 16 commits into
PaddlePaddle:developfrom
chang-wenbin:ALL-GQA_SWA

chang-wenbin commented Jun 12, 2026

Uh oh!

codecov-commenter commented Jun 12, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

chang-wenbin commented Jun 12, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

codecov-commenter commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

历史 Findings 修复情况

📝 PR 规范检查

总体评价

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Jun 12, 2026 •

edited

Loading