-
Notifications
You must be signed in to change notification settings - Fork 4.4k
[WIP] gemm block quantization for llm decoder style #6439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6439 +/- ##
==========================================
+ Coverage 95.62% 95.88% +0.26%
==========================================
Files 844 844
Lines 266761 266834 +73
==========================================
+ Hits 255080 255859 +779
+ Misses 11681 10975 -706 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
The binary size change of libncnn.so (bytes)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This WIP pull request implements block quantization for GEMM layers to support 4-bit, 6-bit, and 8-bit quantization for LLM decoder-style models. The changes introduce a new quantization tool and corresponding dequantization logic in the GEMM layer implementation.
Key changes:
- New
ncnnllm2int468tool for quantizing GEMM weight matrices with configurable block sizes and bit widths - Block-based quantization scheme using per-block scaling factors stored in
B_data_quantize_scales - Dequantization logic in
gemm.cppthat converts quantized weights back to fp32 during model loading
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/quantize/ncnnllm2int468.cpp | New quantization tool implementing 4/6/8-bit block quantization with custom bit-packed storage formats |
| tools/quantize/CMakeLists.txt | Build configuration adding the new ncnnllm2int468 executable |
| tools/modelwriter.h | Extended serialization to save block quantization scales for int8_scale_term values 4/5/6 |
| src/layer/gemm.h | Added B_data_quantize_scales member to store per-block scaling factors |
| src/layer/gemm.cpp | Implemented loading and dequantization logic for 4/6/8-bit block-quantized weights |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| add_executable(ncnnllm2int468 ncnnllm2int468.cpp) | ||
| target_link_libraries(ncnnllm2int468 PRIVATE ncnn) |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new ncnnllm2int468 executable is not added to the virtual project group or installed via ncnn_install_tool(), unlike ncnn2int8 above. This creates inconsistency in how tools are organized and installed.
| union i6x4_t | ||
| { | ||
| signed char i6[3]; | ||
| struct | ||
| { | ||
| signed char i6_a : 6; | ||
| signed char i6_b : 6; | ||
| signed char i6_c : 6; | ||
| signed char i6_d : 6; | ||
| } __attribute__((packed)); | ||
| }; |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The i6x4_t union definition is duplicated in both the quantization tool and the dequantization code in gemm.cpp (lines 193-203). Consider moving this to a shared header to maintain consistency and avoid duplication.
| { | ||
| signed char i4_low : 4; | ||
| signed char i4_high : 4; | ||
| } __attribute__((packed)); |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The __attribute__((packed)) attribute is GCC-specific and not portable. This will fail on MSVC. Consider using #pragma pack for cross-platform compatibility or conditionally compile based on compiler.
| union i4x2_t | ||
| { | ||
| signed char i4; | ||
| struct | ||
| { | ||
| signed char i4_low : 4; | ||
| signed char i4_high : 4; | ||
| } __attribute__((packed)); | ||
| }; |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The i4x2_t union definition is duplicated in both the quantization tool and the dequantization code in gemm.cpp (lines 264-272). Consider moving this to a shared header to maintain consistency and avoid duplication.
| const int block_size = 64; // FIXME hardcode | ||
| // const int nbits = 8; // FIXME hardcode | ||
| const int nbits = 6; // FIXME hardcode |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The block_size value of 64 is hardcoded here and also duplicated in gemm.cpp (lines 143, 183, 254). Consider making it a named constant or passing it as a configurable parameter to avoid inconsistencies if this value needs to change.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Uh oh!
There was an error while loading. Please reload this page.