Skip to content

Conversation

JohannesGaessler
Copy link
Collaborator

This PR refactors and deduplicates the CUDA "tile" FlashAttention kernels, adds support for head size 256, and improves performance for Pascal/AMD without FP16 mma. If fast FP16 math is available it is now used, but the KQ accumulation and the softmax are always down with FP32 precision (as these seem to be the numerically problematic parts of the kernel). The kernel now has a more flexible parameterization, I tuned kq_stride and kq_nbatch for P40/RX 6800/Mi 50. It is also possible to use a warp_size of 64 rather than 32 but I was not able to get better performance this way; I'm keeping the functionality since it's possible I'm currently ignorant about something and may want to revisit this in the future.

Performance changes
GPU Model Microbatch size Test t/s master t/s c9a318a Speedup
MI50 llama 1B Q4_0 16 pp16384 188.69 682.28 3.62
MI50 llama 1B Q4_0 32 pp16384 184.60 956.19 5.18
MI50 llama 1B Q4_0 512 pp16384 185.61 1510.04 8.14
MI50 llama 8B Q4_0 16 pp16384 37.08 194.09 5.23
MI50 llama 8B Q4_0 32 pp16384 38.41 163.29 4.25
MI50 llama 8B Q4_0 512 pp16384 38.35 203.44 5.31
RX 6800 llama 1B Q4_0 16 pp16384 378.27 645.38 1.71
RX 6800 llama 1B Q4_0 32 pp16384 275.62 659.44 2.39
RX 6800 llama 1B Q4_0 512 pp16384 332.24 898.23 2.70
RX 6800 llama 8B Q4_0 16 pp16384 55.59 172.44 3.10
RX 6800 llama 8B Q4_0 32 pp16384 71.36 174.42 2.44
RX 6800 llama 8B Q4_0 512 pp16384 84.39 231.21 2.74
P40 llama 1B Q4_0 16 pp16384 1109.57 1189.42 1.07
P40 llama 1B Q4_0 32 pp16384 1500.00 1686.71 1.12
P40 llama 1B Q4_0 512 pp16384 2052.54 2449.76 1.19
P40 llama 8B Q4_0 16 pp16384 279.17 296.43 1.06
P40 llama 8B Q4_0 32 pp16384 357.32 356.78 1.00
P40 llama 8B Q4_0 512 pp16384 497.47 499.94 1.00

@JohannesGaessler
Copy link
Collaborator Author

FA on vs. off
GPU model n_ubatch fa test t/s
Mi 50 llama 1B Q4_0 16 0 pp16384 536.59
Mi 50 llama 1B Q4_0 16 1 pp16384 680.67
Mi 50 llama 1B Q4_0 32 0 pp16384 828.74
Mi 50 llama 1B Q4_0 32 1 pp16384 955.58
Mi 50 llama 1B Q4_0 512 0 pp16384 1633.99
Mi 50 llama 1B Q4_0 512 1 pp16384 1508.16
Mi 50 llama 8B Q4_0 16 0 pp16384 119.80
Mi 50 llama 8B Q4_0 16 1 pp16384 193.42
Mi 50 llama 8B Q4_0 32 0 pp16384 184.17
Mi 50 llama 8B Q4_0 32 1 pp16384 163.32
Mi 50 llama 8B Q4_0 512 0 pp16384 340.08
Mi 50 llama 8B Q4_0 512 1 pp16384 202.60
Mi 50 gemma 2B Q4_0 16 0 pp16384 522.88
Mi 50 gemma 2B Q4_0 16 1 pp16384 330.48
Mi 50 gemma 2B Q4_0 32 0 pp16384 754.68
Mi 50 gemma 2B Q4_0 32 1 pp16384 313.09
Mi 50 gemma 2B Q4_0 512 0 pp16384 1937.73
Mi 50 gemma 2B Q4_0 512 1 pp16384 404.83
P40 llama 1B Q4_0 16 0 pp16384 194.19
P40 llama 1B Q4_0 16 1 pp16384 1190.37
P40 llama 1B Q4_0 32 0 pp16384 408.35
P40 llama 1B Q4_0 32 1 pp16384 1685.89
P40 llama 1B Q4_0 512 0 pp16384 1278.13
P40 llama 1B Q4_0 512 1 pp16384 2459.82
P40 llama 8B Q4_0 16 0 pp16384 69.48
P40 llama 8B Q4_0 16 1 pp16384 296.45
P40 llama 8B Q4_0 32 0 pp16384 132.42
P40 llama 8B Q4_0 32 1 pp16384 356.64
P40 llama 8B Q4_0 512 0 pp16384 426.63
P40 llama 8B Q4_0 512 1 pp16384 499.74
P40 gemma 2B Q4_0 16 0 pp16384 307.76
P40 gemma 2B Q4_0 16 1 pp16384 791.60
P40 gemma 2B Q4_0 32 0 pp16384 564.02
P40 gemma 2B Q4_0 32 1 pp16384 1061.81
P40 gemma 2B Q4_0 512 0 pp16384 1715.35
P40 gemma 2B Q4_0 512 1 pp16384 1347.84
RX 6800 llama 1B Q4_0 16 0 pp16384 444.98
RX 6800 llama 1B Q4_0 16 1 pp16384 645.99
RX 6800 llama 1B Q4_0 32 0 pp16384 687.48
RX 6800 llama 1B Q4_0 32 1 pp16384 659.64
RX 6800 llama 1B Q4_0 512 0 pp16384 1060.44
RX 6800 llama 1B Q4_0 512 1 pp16384 898.50
RX 6800 llama 8B Q4_0 16 0 pp16384 90.17
RX 6800 llama 8B Q4_0 16 1 pp16384 172.26
RX 6800 llama 8B Q4_0 32 0 pp16384 145.16
RX 6800 llama 8B Q4_0 32 1 pp16384 174.27
RX 6800 llama 8B Q4_0 512 0 pp16384 240.45
RX 6800 llama 8B Q4_0 512 1 pp16384 231.19
RX 6800 gemma 2B Q4_0 16 0 pp16384 480.66
RX 6800 gemma 2B Q4_0 16 1 pp16384 399.05
RX 6800 gemma 2B Q4_0 32 0 pp16384 738.28
RX 6800 gemma 2B Q4_0 32 1 pp16384 313.55
RX 6800 gemma 2B Q4_0 512 0 pp16384 1275.12
RX 6800 gemma 2B Q4_0 512 1 pp16384 568.60

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 3, 2025
@slaren slaren requested a review from IMbackK September 6, 2025 14:09
@IMbackK
Copy link
Collaborator

IMbackK commented Sep 6, 2025

Im currently traveling and wont be able to look at anything until the 13th.

@JohannesGaessler JohannesGaessler merged commit 79bc429 into ggml-org:master Sep 6, 2025
48 checks passed
@Dampfinchen
Copy link

Hmm. I was excited for the headsize 256 support but Flash Attention + Partial Offloading + Quantized KV Cache still destroys prompt processing performance for Gemma 3 12B (but not 27B).

@JohannesGaessler
Copy link
Collaborator Author

The support for the combination of head size 256 + quantized KV cache still has other issues and requires a refactor of the "vector" kernels.

walidbr pushed a commit to walidbr/llama.cpp that referenced this pull request Sep 7, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Sep 9, 2025
njsyw1997 pushed a commit to aizip/llama.cpp that referenced this pull request Sep 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants