-
Notifications
You must be signed in to change notification settings - Fork 12.5k
CUDA: update build CTK version to 12.8 #13360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
CUDA: update build CTK version to 12.8 #13360
Conversation
e1db936
to
c54c98f
Compare
@JohannesGaessler @slaren @ggerganov ping for review |
I am not sure that we need to add real arch 120 to the build. The criteria for selecting the real archs to include is based on what we expect to be the most commonly used GPUs to improve the load time in these cases, but at this point there are likely very few people with RTX 50 series GPUs, below 1% according to the steam hw survey. |
why not? The percantage of people using blackwell for ai, particularly 5090s and RTX 6000 are probably disproportionate to the overall steam hw survey |
- Add comprehensive 22-week implementation roadmap for Blackwell (compute capability 12.0) - Include detailed technical specifications with code examples - Focus on Flash Attention optimizations using Thread Block Clusters - Plan leverages enhanced L2 cache (126 MB) and HBM3/HBM3e memory - Build foundation already complete via PR ggml-org#13360 (CUDA 12.8 + sm120) - Target 20-40% Flash Attention improvement over Ada Lovelace Phase 1: Foundation and architecture detection (accelerated - complete) Phase 2: Thread Block Clusters implementation Phase 3: Flash Attention Blackwell optimizations Phase 4-7: Advanced features, validation, and integration
Is there any reason this is still hanging out there? |
Boycott? 🧐 |
@JohannesGaessler @slaren @ggerganov This commit resolves t/s speed regressions introduced by commit 225e7a1 - 225e7a1 . These speed regressions are described here: #14881 and here #14795 (comment) After investigating this issue, the 4D FlashAttention changes in commit 225e7a1 have compile-time architecture dependencies that prevent them from working properly on RTX 5090/5xxxx systems. Unlike previous versions where Is there a way we can fast track the merging of this PR? Without this PR I get 40 t/s on Qwen3 235B q3, after merging this PR on my own I get >57 t/s? Thank you! |
Hopefully no longer, I just looked at the current steam survey and RTX 5XXX cards account for over 3.5% as of June:https://store.steampowered.com/hwsurvey/videocard/ This PR also fixes speed issues for RTX 5xxx cards introduced mid-month, hopefully my last post helps move it along! |
update CUDA Toolkit version from 12.4 to 12.8 to support compilation of real arch sm120 Blackwell GPUs.
ggml-cuda/CMakeLists.txt
to add compilation of sm120 arch for Blackwell GPUs