-
Notifications
You must be signed in to change notification settings - Fork 660
[Feature] dyc8 support prefixcache #5125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for your contribution! |
| logger.info(f"[rank {self.rank}/{self.n_ranks}] OK! Stop waiting.") | ||
|
|
||
| if args.cache_dtype == "block_wise_fp8": | ||
| cache_type = "uint8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
考虑更多动态量化的情况,这里都会按uint8申请cache;可以修改成不为none就用uint8类型。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里还有可能是 bfloat16 ?后续扩展更多动态量化场景时,是不是直接在这里改一下判断就可以,工作量不大,类似这种 args.cache_dtype in ["block_wise_fp8", **]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for dynamic FP8 quantization (block_wise_fp8) in prefix caching, enabling more memory-efficient KV cache storage with quantization scales. The implementation extends cache management to handle additional scale tensors required for FP8 quantization across GPU and CPU memory hierarchies.
Key Changes:
- Extended cache infrastructure to support block-wise FP8 quantization scales in addition to quantized cache data
- Modified cache swap operations to handle both cache data and scale tensors during GPU-CPU transfers
- Updated configuration to detect and apply block_wise_fp8 quantization type from model config
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/cache_manager/test_cache_transfer_manager.py | Added cache_dtype parameter to test Args class |
| fastdeploy/config.py | Added logic to read kv_cache_quant_type from model quantization config |
| fastdeploy/worker/gpu_model_runner.py | Extended KV cache initialization to create and share cache scale tensors for block_wise_fp8 |
| fastdeploy/cache_manager/cache_transfer_manager.py | Implemented cache scale tensor management including CPU/GPU allocation and swap operations |
| custom_ops/gpu_ops/swap_cache_batch.cu | Updated CUDA kernel to handle both 3D (scales) and 4D (cache) tensor shapes |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #5125 +/- ##
==========================================
Coverage ? 57.76%
==========================================
Files ? 317
Lines ? 38370
Branches ? 5742
==========================================
Hits ? 22166
Misses ? 14428
Partials ? 1776
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
carryyu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Motivation
dy c8 support prefix caching
cp from #4918
Modifications
Usage or Command
使用方式对齐动态c8的启动方式,上下文缓存默认开启,无需额外配置:
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.