Tmp/paul/new #18

vinovo · 2025-09-05T03:26:30Z

Make sure to read the contributing guidelines before submitting a PR

* Add paramater buffer pool, batching of submissions, refactor command building/submission * Add header for linux builds * Free staged parameter buffers at once * Format with clang-format * Fix thread-safe implementation * Use device implicit synchronization * Update workflow to use custom release * Remove testing branch workflow * Disable set_rows until it's implemented * Fix potential issue around empty queue submission * Try synchronous submission * Try waiting on all futures explicitly * Add debug * Add more debug messages * Work on getting ssh access for debugging * Debug on failure * Disable other tests * Remove extra if * Try more locking * maybe passes? * test * Some cleanups * Restore build file * Remove extra testing branch ci

* feat(cann): add optional support for ACL Graph execution This commit adds support for executing ggml computational graphs using Huawei's ACL graph mode via the USE_CANN_GRAPH flag. The support can be enabled at compile time using the CMake option: -DUSE_CANN_GRAPH=ON By default, ACL graph execution is **disabled**, and the fallback path uses node-by-node execution. Key additions: - CMake option to toggle graph mode - Graph capture and execution logic using - Tensor property matching to determine whether graph update is required - Safe fallback and logging if the environment variable LLAMA_SET_ROWS is unset or invalid This prepares the backend for performance improvements in repetitive graph execution scenarios on Ascend devices. Signed-off-by: noemotiovon <[email protected]> * Fix review comments Signed-off-by: noemotiovon <[email protected]> * remane USE_CANN_GRAPH to USE_ACL_GRAPH Signed-off-by: noemotiovon <[email protected]> * fix typo Signed-off-by: noemotiovon <[email protected]> --------- Signed-off-by: noemotiovon <[email protected]>

Signed-off-by: stevenkuang <[email protected]>

* opencl: add `swiglu-oai` * opencl: add `add_id` * opencl: add missing `add_id.cl`

* Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments

This commit addresses an issue with the convert_hf_to_gguf script which is currently failing with: ```console AttributeError: module 'torch' has no attribute 'uint64' ``` This occurred because safetensors expects torch.uint64 to be available in the public API, but PyTorch 2.2.x only provides limited support for unsigned types beyond uint8 it seems. The torch.uint64 dtype exists but is not exposed in the standard torch namespace (see pytorch/pytorch#58734). PyTorch 2.4.0 properly exposes torch.uint64 in the public API, resolving the compatibility issue with safetensors. This also required torchvision to updated to =0.19.0 for compatibility. Refs: https://huggingface.co/spaces/ggml-org/gguf-my-repo/discussions/186#68938de803e47d990aa087fb Refs: pytorch/pytorch#58734

* CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16

Any available libraries are found and loaded dynamically at runtime.

…age metrics (#15103)

* support internvl * support interns1 * resolve comments * put interns1 in tensor mapping * resolve comment * move tokenizer changes to sub class

* convert : support non-mxfp4 HF model * rm redundant check * disable debug check

* vendor: sync minja * Update minja.hpp * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

* server-bench: external OAI servers, sqlite * Update scripts/server-bench.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update scripts/server-bench.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update scripts/server-bench.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * raise_for_status --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

* gguf-py : add MXFP4 de/quantization support * ggml-quants : handle zero amax for MXFP4

* CUDA: add attention sinks for tile and wmma * Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma

* cuda: refactored ssm_scan to use CUB * fixed compilation error when when not using CUB * assign L to constant and use size_t instead of int * deduplicated functions * change min blocks per mp to 1 * Use cub load and store warp transpose * suppress clang warning

* kleidiai: fix unsigned overflow bug * address review comments

* Improve Mistral models integration with llama.cpp * Revert changes and fix gguf * Revert change * refactor convert_mistral_to_gguf.py in convert_hf_to_gguf.py * Revert collateral * Rename model name * refactor * revert * remove duplicate * Remove duplication code * Fixes * Fix flake issues * Apply comments * Apply comments * Apply comments * Fix remote * add default chat template * Revert * nit

This commit updates comments and error messages to use "decode" instead of "eval" in perplexity.cpp. The motivation for this is that `llama_eval` was renamed to `llama_decode` a while ago, but the comments and error messages still referred to "eval". This change ensures consistency and clarity.

* chat : clarify the meaning of reasoning_format * add link to this PR

…5385) * Added VSX intrinsics for Power9+ systems Signed-off-by: mgiessing <[email protected]> * Manual unrolling for minor perf improvement Signed-off-by: mgiessing <[email protected]> * Update ggml/src/ggml-cpu/arch/powerpc/quants.c Co-authored-by: Georgi Gerganov <[email protected]> --------- Signed-off-by: mgiessing <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

…15413) Signed-off-by: Xiaodong Ye <[email protected]>

* optimize rope ops * amendment * delete trailing whitespace * change the variable name

* server : disable context shift by default ggml-ci * server : make scopr of test parameters local

Fixes #15423.

…5375)

* musa: fix build warnings Signed-off-by: Xiaodong Ye <[email protected]> * fix warning: comparison of integers of different signs: 'const int' and 'unsigned int' [-Wsign-compare] Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>

* lookahead : add sample command to readme * cont : build-agnostic command

This commit removes the content from the Makefile and updates the current deprecation message to information that `make` has been replaced by CMake instead. The message when `make` is invoked will now be the following: ```console $ make Makefile:6: *** Build system changed: The Makefile build has been replaced by CMake. For build instructions see: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md . Stop. ``` The motivation for this is that many, if not all targets fail to build now, after changes to the system, and `make` has also been deprected for some time now.

* Update docker.yml 修改docker.yml文件中的内容使其停止周期性的运行该workflow，如果想要运行该workflow可以手动启动 * feat:Modify the header file include path 1. There's no llava directory in the tools directory. 2. Because the command `target_include_directories(mtmd PUBLIC .)` is used in the `mtmd` CMakeLists.txt file, other targets that link against `mtmd` automatically include the `mtmd` directory as a search path for header files. Therefore, you can remove `target_include_directories(${TARGET} PRIVATE ../llava`` or use `target_include_directories(${TARGET} PRIVATE ../mtmd`` to explicitly require the `llama-server` target to use header files from `mtmd`. * Restore the docker.yml file

Signed-off-by: Jie Fu <[email protected]>

This commit addresses an inconsistency during inference by adding a new member to the `templates_params` struct to indicate whether the chat is in inference mode. This allows the gpt-oss specific function `common_chat_params_init_gpt_oss` to check this flag and the `add_generation_prompt` flag to determine if it should replace the `<|return|>` token with the `<|end|>` token in the prompt. The motivation for this change is to ensure that the formatted prompt of past messages in `common_chat_format_single` matches the output of the formatted new message. The issue is that the gpt-oss template returns different end tags: `<|return|>` when `add_generation_prompt` is false, and `<|end|>` when `add_generation_prompt` is true. This causes the substring function to start at an incorrect position, resulting in tokenization starting with 'tart|>' instead of '<|start|>'. Resolves: ggml-org/llama.cpp#15417

These detailed strings were causing increased build time on gcc.

…eams (#15444)

…(#15346)

Remilia/chore/update upstream

This reverts commit 04a2630.

feat(common): use llama internal log

reeselevine and others added 30 commits August 5, 2025 16:26

chat : fix hunyuan auto-detection (#15114)

2572689

Signed-off-by: stevenkuang <[email protected]>

chat : fix yandex chat template (#15116)

65c797c

ggml : fix fallback to CPU for ununsupported ops (#15118)

0d88315

Fixed name -override-tensors to -override-tensor (#15129)

476aa3f

chat : support Granite model reasoning and tool call (#14864)

3db4da5

opencl: add swiglu_oai and add_id (#15121)

e725a1a

* opencl: add `swiglu-oai` * opencl: add `add_id` * opencl: add missing `add_id.cl`

fix profiling crash (#15072)

756cfea

ggml: Add basic SET_ROWS support in WebGPU (#15137)

5fd160b

* Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments

scripts: fix crash when --tool is not set (#15133)

20638e4

CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16 (#15131)

1d72c84

* CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16

ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (#15094)

9a96389

Any available libraries are found and loaded dynamically at runtime.

HIP: add cmake option to enable compiler output of kernel resource us…

7ad67ba

…age metrics (#15103)

llama : Support intern-s1 (#14875)

99acbc9

* support internvl * support interns1 * resolve comments * put interns1 in tensor mapping * resolve comment * move tokenizer changes to sub class

vulkan: Add env var to disable host visible vidmem (#15109)

a0552c8

vulkan: support fattn sinks (#15126)

c4f5356

convert : support non-mxfp4 HF model (#15153)

50aa938

* convert : support non-mxfp4 HF model * rm redundant check * disable debug check

opencl: support sink in soft_max (attn sinks) (#15152)

aaa3d07

CUDA: attention sinks for mma FlashAttention (#15157)

1425f58

vendor: sync minja (#15161)

6c7e9a5

* vendor: sync minja * Update minja.hpp * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

ggml : fix field name when new ggml_backend (#14944)

cd6983d

gguf-py : add Numpy MXFP4 de/quantization support (#15111)

e54d41b

* gguf-py : add MXFP4 de/quantization support * ggml-quants : handle zero amax for MXFP4

CUDA: add attention sinks for tile and wmma (#15178)

34c9d76

* CUDA: add attention sinks for tile and wmma * Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma

kleidiai: fix unsigned overflow bug (#15150)

002cb1b

* kleidiai: fix unsigned overflow bug * address review comments

ggerganov and others added 28 commits August 19, 2025 08:45

batched-bench : use rand tokens (#15398)

f0d3c74

server : remove swa_full warning (#15399)

9d262f4

chat : clarify the meaning of reasoning_format (#15408)

e9288e8

* chat : clarify the meaning of reasoning_format * add link to this PR

musa: handle __hgt2_mask, available starting from MUSA SDK rc4.3.0 (#…

67f09a3

…15413) Signed-off-by: Xiaodong Ye <[email protected]>

CANN: optimize rope operator (#15335)

a6d3cfe

* optimize rope ops * amendment * delete trailing whitespace * change the variable name

server : disable context shift by default (#15416)

d2fcd91

* server : disable context shift by default ggml-ci * server : make scopr of test parameters local

common : Add top-nsigma sampler to help globally (#15428)

1e19f5d

Fixes #15423.

model : add gpt-oss type strings (#15424)

9ef6b0b

opencl: mark argsort unsupported if cols exceed workgroup limit (#1…

fb22dd0

…5375)

lookahead : add sample command to readme (#15447)

2f37014

* lookahead : add sample command to readme * cont : build-agnostic command

common : fix context shift help message (#15448)

ec5ab1a

Signed-off-by: Jie Fu <[email protected]>

vulkan: shorten pipeline name strings (#15431)

fec9519

These detailed strings were causing increased build time on gcc.

CUDA: replace GGML_CUDA_F16 with CUDA arch checks (#15433)

7a6e91a

CUDA: refactor FA support/selection code (#15454)

13aeb7a

server: fix OpenAI API compatibility for usage statistics in chat str…

1bc664a

…eams (#15444)

sched : copy only the used experts when offloading prompt processing …

5682a37

…(#15346)

Merge branch 'main' into remilia/chore/update_upstream

620e8ed

Merge pull request #9 from NexaAI/remilia/chore/update_upstream

2dc0fed

Remilia/chore/update upstream

feat: gemma 3n multimodal support

2f51960

wip

04a2630

Revert "wip" (revert the change that comment out llama logs)

fa90a9f

This reverts commit 04a2630.

feat(common): use llama internal log

971bd1d

Merge pull request #14 from NexaAI/feat/remilia/log

e68f6a9

feat(common): use llama internal log

vinovo closed this Sep 7, 2025

vinovo deleted the tmp/paul/new branch September 7, 2025 18:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tmp/paul/new #18

Tmp/paul/new #18

Uh oh!

vinovo commented Sep 5, 2025

Uh oh!

Uh oh!

Tmp/paul/new #18

Tmp/paul/new #18

Uh oh!

Conversation

vinovo commented Sep 5, 2025

Uh oh!

Uh oh!