Releases: ngxson/llama.cpp
Releases · ngxson/llama.cpp
b6075
b6074
CUDA: use mma FA kernel for gqa > 4 on RTX 4000 (#15035)
b6073
cuda: make im2col a little faster (#15025)
b6071
llama : enable LLAMA_SET_ROWS=1 by default (#14959) ggml-ci
b6067
chat : fix multiple tool_calls on hermes-2-pro (#14962)
b6066
vulkan: coopmat2 mul_mat optimizations (#14934) - Increase tile size for k-quants, to match non-k-quants - Choose more carefully between large and medium tiles, considering how it interacts with split_k - Allow larger/non-power of two split_k, and make the splits a multiple of 256 - Use split_k==3 to when >1/2 and <=2/3 of the SMs would hae been used
b6064
vulkan: Support ne[3]>1 in noncontig matrix-vector multiply (#15015)
b6063
model : support Qwen3-Embedding (#15023)
b6062
server: enable token array inputs for OAI API (#15001)
b6061
vulkan: optimizations for direct convolution (#14933) * vulkan: optimizations for direct convolution - Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill the GPU. The new size should be amenable to using coopmat, too. - Fix shmem bank conflicts. 16B padding should work with coopmat. - Some explicit loop unrolling. - Skip math/stores work for parts of the tile that are OOB. - Apply fastdiv opt. - Disable shuffles for NV. * Three tiles sizes for CONV_2D, and a heuristic to choose * reallow collectives for pre-Turing * make SHMEM_PAD a spec constant * fixes for intel perf - no shmem padding, placeholder shader core count * shader variants with/without unrolling * 0cc4m's fixes for AMD perf Co-authored-by: 0cc4m <[email protected]> --------- Co-authored-by: 0cc4m <[email protected]>