[Feature] dyc8 support prefixcache #5125

kevincheng2 · 2025-11-19T06:38:35Z

Motivation

dy c8 support prefix caching
cp from #4918

Modifications

cache_manager 和 worker 中cache创建、同步时增加动态量化需要的 cache_scales 参数
修改 swap_cache_batch.cu 算子，支持多级缓存下 cache_scales 的换入换出(gpu <-> cpu)

Usage or Command

使用方式对齐动态c8的启动方式，上下文缓存默认开启，无需额外配置：

  python -m fastdeploy.entrypoints.openai.api_server \
       ...
       --quantization '{"dense_quant_type":"block_wise_fp8", "moe_quant_type":"block_wise_fp8", "quantization":"mix_quant", "kv_cache_quant_type":"block_wise_fp8"}'

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2025-11-19T06:38:46Z

Thanks for your contribution!

carryyu · 2025-11-20T09:23:19Z

fastdeploy/cache_manager/cache_transfer_manager.py

            logger.info(f"[rank {self.rank}/{self.n_ranks}] OK! Stop waiting.")

+        if args.cache_dtype == "block_wise_fp8":
+            cache_type = "uint8"


考虑更多动态量化的情况，这里都会按uint8申请cache；可以修改成不为none就用uint8类型。

这里还有可能是 bfloat16 ？后续扩展更多动态量化场景时，是不是直接在这里改一下判断就可以，工作量不大，类似这种 args.cache_dtype in ["block_wise_fp8", **]

Copilot

Pull Request Overview

This PR adds support for dynamic FP8 quantization (block_wise_fp8) in prefix caching, enabling more memory-efficient KV cache storage with quantization scales. The implementation extends cache management to handle additional scale tensors required for FP8 quantization across GPU and CPU memory hierarchies.

Key Changes:

Extended cache infrastructure to support block-wise FP8 quantization scales in addition to quantized cache data
Modified cache swap operations to handle both cache data and scale tensors during GPU-CPU transfers
Updated configuration to detect and apply block_wise_fp8 quantization type from model config

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/cache_manager/test_cache_transfer_manager.py	Added cache_dtype parameter to test Args class
fastdeploy/config.py	Added logic to read kv_cache_quant_type from model quantization config
fastdeploy/worker/gpu_model_runner.py	Extended KV cache initialization to create and share cache scale tensors for block_wise_fp8
fastdeploy/cache_manager/cache_transfer_manager.py	Implemented cache scale tensor management including CPU/GPU allocation and swap operations
custom_ops/gpu_ops/swap_cache_batch.cu	Updated CUDA kernel to handle both 3D (scales) and 4D (cache) tensor shapes

fastdeploy/cache_manager/cache_transfer_manager.py

fastdeploy/config.py

fastdeploy/cache_manager/cache_transfer_manager.py

codecov-commenter · 2025-11-21T08:54:17Z

Codecov Report

❌ Patch coverage is 18.46154% with 53 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@6471dad). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/cache_manager/cache_transfer_manager.py	15.38%	39 Missing and 5 partials ⚠️
fastdeploy/worker/gpu_model_runner.py	18.18%	7 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #5125   +/-   ##
==========================================
  Coverage           ?   57.76%           
==========================================
  Files              ?      317           
  Lines              ?    38370           
  Branches           ?     5742           
==========================================
  Hits               ?    22166           
  Misses             ?    14428           
  Partials           ?     1776

Flag	Coverage Δ
diff	`57.76% <18.46%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

carryyu

LGTM

dyc8 support prefixcache

bb8fad4

fix cache_trans test case

babd2d8

carryyu reviewed Nov 20, 2025

View reviewed changes

rainyfly previously approved these changes Nov 20, 2025

View reviewed changes

Jiang-Jia-Jun requested a review from Copilot November 21, 2025 02:26

Copilot started reviewing on behalf of Jiang-Jia-Jun November 21, 2025 02:27 View session

Copilot finished reviewing on behalf of Jiang-Jia-Jun November 21, 2025 02:29

Copilot AI reviewed Nov 21, 2025

View reviewed changes

update code

183c0c3

kevincheng2 dismissed rainyfly’s stale review via 183c0c3 November 21, 2025 07:43

Merge branch 'develop' into dy_c8_cp

579f9b4

carryyu approved these changes Nov 21, 2025

View reviewed changes

yuanlehome approved these changes Nov 21, 2025

View reviewed changes

Jiang-Jia-Jun merged commit c068a4f into PaddlePaddle:develop Nov 21, 2025
21 of 27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] dyc8 support prefixcache #5125

[Feature] dyc8 support prefixcache #5125

kevincheng2 commented Nov 19, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Nov 19, 2025

Uh oh!

carryyu Nov 20, 2025

Uh oh!

kevincheng2 Nov 20, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Nov 21, 2025 •

edited

Loading

Uh oh!

carryyu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[Feature] dyc8 support prefixcache #5125

[Feature] dyc8 support prefixcache #5125

Conversation

kevincheng2 commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Nov 19, 2025

Uh oh!

carryyu Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

kevincheng2 Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

carryyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kevincheng2 commented Nov 19, 2025 •

edited

Loading

codecov-commenter commented Nov 21, 2025 •

edited

Loading