CUDA: update build CTK version to 12.8 #13360

thevishalagarwal · 2025-05-07T17:00:58Z

update CUDA Toolkit version from 12.4 to 12.8 to support compilation of real arch sm120 Blackwell GPUs.

updated ggml-cuda/CMakeLists.txt to add compilation of sm120 arch for Blackwell GPUs
updated CUDAToolkit version from 12.4 to 12.8 for windows github CI build

thevishalagarwal · 2025-05-12T14:41:55Z

@JohannesGaessler @slaren @ggerganov ping for review

slaren · 2025-05-14T13:20:46Z

I am not sure that we need to add real arch 120 to the build. The criteria for selecting the real archs to include is based on what we expect to be the most commonly used GPUs to improve the load time in these cases, but at this point there are likely very few people with RTX 50 series GPUs, below 1% according to the steam hw survey.

thad0ctor · 2025-07-02T19:39:20Z

I am not sure that we need to add real arch 120 to the build. The criteria for selecting the real archs to include is based on what we expect to be the most commonly used GPUs to improve the load time in these cases, but at this point there are likely very few people with RTX 50 series GPUs, below 1% according to the steam hw survey.

why not? The percantage of people using blackwell for ai, particularly 5090s and RTX 6000 are probably disproportionate to the overall steam hw survey

- Add comprehensive 22-week implementation roadmap for Blackwell (compute capability 12.0) - Include detailed technical specifications with code examples - Focus on Flash Attention optimizations using Thread Block Clusters - Plan leverages enhanced L2 cache (126 MB) and HBM3/HBM3e memory - Build foundation already complete via PR ggml-org#13360 (CUDA 12.8 + sm120) - Target 20-40% Flash Attention improvement over Ada Lovelace Phase 1: Foundation and architecture detection (accelerated - complete) Phase 2: Thread Block Clusters implementation Phase 3: Flash Attention Blackwell optimizations Phase 4-7: Advanced features, validation, and integration

thad0ctor · 2025-07-27T23:00:04Z

Is there any reason this is still hanging out there?

Thireus · 2025-07-27T23:01:42Z

Boycott? 🧐

thad0ctor · 2025-07-27T23:38:30Z

I am not sure that we need to add real arch 120 to the build. The criteria for selecting the real archs to include is based on what we expect to be the most commonly used GPUs to improve the load time in these cases, but at this point there are likely very few people with RTX 50 series GPUs, below 1% according to the steam hw survey.

@JohannesGaessler @slaren @ggerganov

This commit resolves t/s speed regressions introduced by commit 225e7a1 - 225e7a1 . These speed regressions are described here: #14881 and here #14795 (comment)

After investigating this issue, the 4D FlashAttention changes in commit 225e7a1 have compile-time architecture dependencies that prevent them from working properly on RTX 5090/5xxxx systems. Unlike previous versions where
manual -DCMAKE_CUDA_ARCHITECTURES="120" overrides were sufficient, the new 4D implementation requires compute capability 12.0 to be explicitly included in the default CMakeLists.txt architecture configuration or it defaults to a comiling with lesser performance. The 4D code contains compile-time checks and template instantiations that only generate the necessary kernel variants when 120-real is present in the default build targets - manual cmake overrides no longer work because the required code paths simply aren't compiled.

Is there a way we can fast track the merging of this PR? Without this PR I get 40 t/s on Qwen3 235B q3, after merging this PR on my own I get >57 t/s? Thank you!

thad0ctor · 2025-07-27T23:51:15Z

Boycott? 🧐

Hopefully no longer, I just looked at the current steam survey and RTX 5XXX cards account for over 3.5% as of June:https://store.steampowered.com/hwsurvey/videocard/

This PR also fixes speed issues for RTX 5xxx cards introduced mid-month, hopefully my last post helps move it along!

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels May 7, 2025

thevishalagarwal added 2 commits May 12, 2025 17:18

cuda: compile sm120 for ctk 12.8

1307bb8

remove whitespaces

c54c98f

thevishalagarwal force-pushed the github-workflow/update-cuda-12.8 branch from e1db936 to c54c98f Compare May 12, 2025 11:53

update to ctk 12.8

d971e0a

This was referenced Jul 27, 2025

Speed regression with -fa and -ctk #14881

Open

Eval bug: LLAMA_SET_ROWS=1 gibberish output with Dual GPU offload #14795

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: update build CTK version to 12.8 #13360

CUDA: update build CTK version to 12.8 #13360

thevishalagarwal commented May 7, 2025 •

edited

Loading

Uh oh!

thevishalagarwal commented May 12, 2025

Uh oh!

slaren commented May 14, 2025

Uh oh!

thad0ctor commented Jul 2, 2025

Uh oh!

thad0ctor commented Jul 27, 2025

Uh oh!

Thireus commented Jul 27, 2025

Uh oh!

thad0ctor commented Jul 27, 2025

Uh oh!

thad0ctor commented Jul 27, 2025

Uh oh!

Uh oh!

CUDA: update build CTK version to 12.8 #13360

Are you sure you want to change the base?

CUDA: update build CTK version to 12.8 #13360

Conversation

thevishalagarwal commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thevishalagarwal commented May 12, 2025

Uh oh!

slaren commented May 14, 2025

Uh oh!

thad0ctor commented Jul 2, 2025

Uh oh!

thad0ctor commented Jul 27, 2025

Uh oh!

Thireus commented Jul 27, 2025

Uh oh!

thad0ctor commented Jul 27, 2025

Uh oh!

thad0ctor commented Jul 27, 2025

Uh oh!

Uh oh!

thevishalagarwal commented May 7, 2025 •

edited

Loading