Skip to content

Conversation

@kevincheng2
Copy link
Collaborator

@kevincheng2 kevincheng2 commented Nov 19, 2025

Motivation

dy c8 support prefix caching
cp from #4918

Modifications

  1. cache_manager 和 worker 中cache创建、同步时增加动态量化需要的 cache_scales 参数
  2. 修改 swap_cache_batch.cu 算子,支持多级缓存下 cache_scales 的换入换出(gpu <-> cpu)

Usage or Command

使用方式对齐动态c8的启动方式,上下文缓存默认开启,无需额外配置:

  python -m fastdeploy.entrypoints.openai.api_server \
       ...
       --quantization '{"dense_quant_type":"block_wise_fp8", "moe_quant_type":"block_wise_fp8", "quantization":"mix_quant", "kv_cache_quant_type":"block_wise_fp8"}' 

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Nov 19, 2025

Thanks for your contribution!

logger.info(f"[rank {self.rank}/{self.n_ranks}] OK! Stop waiting.")

if args.cache_dtype == "block_wise_fp8":
cache_type = "uint8"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

考虑更多动态量化的情况,这里都会按uint8申请cache;可以修改成不为none就用uint8类型。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里还有可能是 bfloat16 ?后续扩展更多动态量化场景时,是不是直接在这里改一下判断就可以,工作量不大,类似这种 args.cache_dtype in ["block_wise_fp8", **]

rainyfly
rainyfly previously approved these changes Nov 20, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for dynamic FP8 quantization (block_wise_fp8) in prefix caching, enabling more memory-efficient KV cache storage with quantization scales. The implementation extends cache management to handle additional scale tensors required for FP8 quantization across GPU and CPU memory hierarchies.

Key Changes:

  • Extended cache infrastructure to support block-wise FP8 quantization scales in addition to quantized cache data
  • Modified cache swap operations to handle both cache data and scale tensors during GPU-CPU transfers
  • Updated configuration to detect and apply block_wise_fp8 quantization type from model config

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/cache_manager/test_cache_transfer_manager.py Added cache_dtype parameter to test Args class
fastdeploy/config.py Added logic to read kv_cache_quant_type from model quantization config
fastdeploy/worker/gpu_model_runner.py Extended KV cache initialization to create and share cache scale tensors for block_wise_fp8
fastdeploy/cache_manager/cache_transfer_manager.py Implemented cache scale tensor management including CPU/GPU allocation and swap operations
custom_ops/gpu_ops/swap_cache_batch.cu Updated CUDA kernel to handle both 3D (scales) and 4D (cache) tensor shapes

@codecov-commenter
Copy link

codecov-commenter commented Nov 21, 2025

Codecov Report

❌ Patch coverage is 18.46154% with 53 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@6471dad). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/cache_manager/cache_transfer_manager.py 15.38% 39 Missing and 5 partials ⚠️
fastdeploy/worker/gpu_model_runner.py 18.18% 7 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #5125   +/-   ##
==========================================
  Coverage           ?   57.76%           
==========================================
  Files              ?      317           
  Lines              ?    38370           
  Branches           ?     5742           
==========================================
  Hits               ?    22166           
  Misses             ?    14428           
  Partials           ?     1776           
Flag Coverage Δ
diff 57.76% <18.46%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@carryyu carryyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit c068a4f into PaddlePaddle:develop Nov 21, 2025
21 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants