Skip to content

CUDA: update build CTK version to 12.8 #13360

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

thevishalagarwal
Copy link
Contributor

@thevishalagarwal thevishalagarwal commented May 7, 2025

update CUDA Toolkit version from 12.4 to 12.8 to support compilation of real arch sm120 Blackwell GPUs.

  • updated ggml-cuda/CMakeLists.txt to add compilation of sm120 arch for Blackwell GPUs
  • updated CUDAToolkit version from 12.4 to 12.8 for windows github CI build

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels May 7, 2025
@thevishalagarwal thevishalagarwal force-pushed the github-workflow/update-cuda-12.8 branch from e1db936 to c54c98f Compare May 12, 2025 11:53
@thevishalagarwal
Copy link
Contributor Author

@JohannesGaessler @slaren @ggerganov ping for review

@slaren
Copy link
Member

slaren commented May 14, 2025

I am not sure that we need to add real arch 120 to the build. The criteria for selecting the real archs to include is based on what we expect to be the most commonly used GPUs to improve the load time in these cases, but at this point there are likely very few people with RTX 50 series GPUs, below 1% according to the steam hw survey.

@thad0ctor
Copy link

I am not sure that we need to add real arch 120 to the build. The criteria for selecting the real archs to include is based on what we expect to be the most commonly used GPUs to improve the load time in these cases, but at this point there are likely very few people with RTX 50 series GPUs, below 1% according to the steam hw survey.

why not? The percantage of people using blackwell for ai, particularly 5090s and RTX 6000 are probably disproportionate to the overall steam hw survey

thad0ctor added a commit to thad0ctor/llama.cpp that referenced this pull request Jul 2, 2025
- Add comprehensive 22-week implementation roadmap for Blackwell (compute capability 12.0)
- Include detailed technical specifications with code examples
- Focus on Flash Attention optimizations using Thread Block Clusters
- Plan leverages enhanced L2 cache (126 MB) and HBM3/HBM3e memory
- Build foundation already complete via PR ggml-org#13360 (CUDA 12.8 + sm120)
- Target 20-40% Flash Attention improvement over Ada Lovelace

Phase 1: Foundation and architecture detection (accelerated - complete)
Phase 2: Thread Block Clusters implementation
Phase 3: Flash Attention Blackwell optimizations
Phase 4-7: Advanced features, validation, and integration
@thad0ctor
Copy link

Is there any reason this is still hanging out there?

@Thireus
Copy link

Thireus commented Jul 27, 2025

Boycott? 🧐

@thad0ctor
Copy link

I am not sure that we need to add real arch 120 to the build. The criteria for selecting the real archs to include is based on what we expect to be the most commonly used GPUs to improve the load time in these cases, but at this point there are likely very few people with RTX 50 series GPUs, below 1% according to the steam hw survey.

@JohannesGaessler @slaren @ggerganov

This commit resolves t/s speed regressions introduced by commit 225e7a1 - 225e7a1 . These speed regressions are described here: #14881 and here #14795 (comment)

After investigating this issue, the 4D FlashAttention changes in commit 225e7a1 have compile-time architecture dependencies that prevent them from working properly on RTX 5090/5xxxx systems. Unlike previous versions where
manual -DCMAKE_CUDA_ARCHITECTURES="120" overrides were sufficient, the new 4D implementation requires compute capability 12.0 to be explicitly included in the default CMakeLists.txt architecture configuration or it defaults to a comiling with lesser performance. The 4D code contains compile-time checks and template instantiations that only generate the necessary kernel variants when 120-real is present in the default build targets - manual cmake overrides no longer work because the required code paths simply aren't compiled.

Is there a way we can fast track the merging of this PR? Without this PR I get 40 t/s on Qwen3 235B q3, after merging this PR on my own I get >57 t/s? Thank you!

@thad0ctor
Copy link

Boycott? 🧐

Hopefully no longer, I just looked at the current steam survey and RTX 5XXX cards account for over 3.5% as of June:https://store.steampowered.com/hwsurvey/videocard/

This PR also fixes speed issues for RTX 5xxx cards introduced mid-month, hopefully my last post helps move it along!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants