Skip to content

Conversation

vinovo
Copy link
Collaborator

@vinovo vinovo commented Sep 5, 2025

Make sure to read the contributing guidelines before submitting a PR

reeselevine and others added 30 commits August 5, 2025 16:26
* Add paramater buffer pool, batching of submissions, refactor command building/submission

* Add header for linux builds

* Free staged parameter buffers at once

* Format with clang-format

* Fix thread-safe implementation

* Use device implicit synchronization

* Update workflow to use custom release

* Remove testing branch workflow

* Disable set_rows until it's implemented

* Fix potential issue around empty queue submission

* Try synchronous submission

* Try waiting on all futures explicitly

* Add debug

* Add more debug messages

* Work on getting ssh access for debugging

* Debug on failure

* Disable other tests

* Remove extra if

* Try more locking

* maybe passes?

* test

* Some cleanups

* Restore build file

* Remove extra testing branch ci
* feat(cann): add optional support for ACL Graph execution

This commit adds support for executing ggml computational graphs using
Huawei's ACL graph mode via the USE_CANN_GRAPH flag. The support can be
enabled at compile time using the CMake option:

    -DUSE_CANN_GRAPH=ON

By default, ACL graph execution is **disabled**, and the fallback path
uses node-by-node execution.

Key additions:
- CMake option  to toggle graph mode
- Graph capture and execution logic using
- Tensor property matching to determine whether graph update is required
- Safe fallback and logging if the environment variable LLAMA_SET_ROWS
  is unset or invalid

This prepares the backend for performance improvements in repetitive graph
execution scenarios on Ascend devices.

Signed-off-by: noemotiovon <[email protected]>

* Fix review comments

Signed-off-by: noemotiovon <[email protected]>

* remane USE_CANN_GRAPH to USE_ACL_GRAPH

Signed-off-by: noemotiovon <[email protected]>

* fix typo

Signed-off-by: noemotiovon <[email protected]>

---------

Signed-off-by: noemotiovon <[email protected]>
* opencl: add `swiglu-oai`

* opencl: add `add_id`

* opencl: add missing `add_id.cl`
* Begin work on set_rows

* Work on set rows

* Add error buffers for reporting unsupported SET_ROWS indices

* Remove extra comments
This commit addresses an issue with the convert_hf_to_gguf script
which is currently failing with:
```console
AttributeError: module 'torch' has no attribute 'uint64'
```

This occurred because safetensors expects torch.uint64 to be available
in the public API, but PyTorch 2.2.x only provides limited support for
unsigned types beyond uint8 it seems. The torch.uint64 dtype exists but
is not exposed in the standard torch namespace
(see pytorch/pytorch#58734).

PyTorch 2.4.0 properly exposes torch.uint64 in the public API, resolving
the compatibility issue with safetensors. This also required torchvision
to updated to =0.19.0 for compatibility.

Refs: https://huggingface.co/spaces/ggml-org/gguf-my-repo/discussions/186#68938de803e47d990aa087fb
Refs: pytorch/pytorch#58734
* CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16
Any available libraries are found and loaded dynamically at runtime.
* support internvl

* support interns1

* resolve comments

* put interns1 in tensor mapping

* resolve comment

* move tokenizer changes to sub class
* convert : support non-mxfp4 HF model

* rm redundant check

* disable debug check
* vendor: sync minja

* Update minja.hpp

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>
* server-bench: external OAI servers, sqlite

* Update scripts/server-bench.py

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Update scripts/server-bench.py

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Update scripts/server-bench.py

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* raise_for_status

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>
* gguf-py : add MXFP4 de/quantization support

* ggml-quants : handle zero amax for MXFP4
* CUDA: add attention sinks for tile and wmma

* Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma
* cuda: refactored ssm_scan to use CUB

* fixed compilation error when when not using CUB

* assign L to constant and use size_t instead of int

* deduplicated functions

* change min blocks per mp to 1

* Use cub load and store warp transpose

* suppress clang warning
* kleidiai: fix unsigned overflow bug

* address review comments
* Improve Mistral models integration with llama.cpp

* Revert changes and fix gguf

* Revert change

* refactor convert_mistral_to_gguf.py in convert_hf_to_gguf.py

* Revert collateral

* Rename model name

* refactor

* revert

* remove duplicate

* Remove duplication code

* Fixes

* Fix flake issues

* Apply comments

* Apply comments

* Apply comments

* Fix remote

* add default chat template

* Revert

* nit
This commit updates comments and error messages to use "decode" instead
of "eval" in perplexity.cpp.

The motivation for this is that `llama_eval` was renamed to
`llama_decode` a while ago, but the comments and error messages
still referred to "eval". This change ensures consistency and clarity.
ggerganov and others added 28 commits August 19, 2025 08:45
* chat : clarify the meaning of reasoning_format

* add link to this PR
…5385)

* Added VSX intrinsics for Power9+ systems

Signed-off-by: mgiessing <[email protected]>

* Manual unrolling for minor perf improvement

Signed-off-by: mgiessing <[email protected]>

* Update ggml/src/ggml-cpu/arch/powerpc/quants.c

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Signed-off-by: mgiessing <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
* optimize rope ops

* amendment

* delete trailing whitespace

* change the variable name
* server : disable context shift by default

ggml-ci

* server : make scopr of test parameters local
* musa: fix build warnings

Signed-off-by: Xiaodong Ye <[email protected]>

* fix warning: comparison of integers of different signs: 'const int' and 'unsigned int' [-Wsign-compare]

Signed-off-by: Xiaodong Ye <[email protected]>

---------

Signed-off-by: Xiaodong Ye <[email protected]>
* lookahead : add sample command to readme

* cont : build-agnostic command
This commit removes the content from the Makefile and updates the
current deprecation message to information that `make` has been
replaced by CMake instead.

The message when `make` is invoked will now be the following:
```console
$ make
Makefile:6: *** Build system changed:
 The Makefile build has been replaced by CMake.

 For build instructions see:
 https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md

.  Stop.
```

The motivation for this is that many, if not all targets fail to build
now, after changes to the system, and `make` has also been deprected for
some time now.
* Update docker.yml

修改docker.yml文件中的内容使其停止周期性的运行该workflow,如果想要运行该workflow可以手动启动

* feat:Modify the header file include path

1. There's no llava directory in the tools directory.
2. Because the command `target_include_directories(mtmd PUBLIC .)` is used in the `mtmd` CMakeLists.txt file, other targets that link against `mtmd` automatically include the `mtmd` directory as a search path for header files. Therefore, you can remove `target_include_directories(${TARGET} PRIVATE ../llava`` or use `target_include_directories(${TARGET} PRIVATE ../mtmd`` to explicitly require the `llama-server` target to use header files from `mtmd`.

* Restore the docker.yml file
This commit addresses an inconsistency during inference by adding a new
member to the `templates_params` struct to indicate whether the chat is
in inference mode. This allows the gpt-oss specific function
`common_chat_params_init_gpt_oss` to check this flag and the
`add_generation_prompt` flag to determine if it should replace the
`<|return|>` token with the `<|end|>` token in the prompt.

The motivation for this change is to ensure that the formatted prompt of
past messages in `common_chat_format_single` matches the output of the
formatted new message. The issue is that the gpt-oss template returns
different end tags: `<|return|>` when `add_generation_prompt` is false,
and `<|end|>` when `add_generation_prompt` is true. This causes the
substring function to start at an incorrect position, resulting in
tokenization starting with 'tart|>' instead of '<|start|>'.

Resolves: ggml-org/llama.cpp#15417
These detailed strings were causing increased build time on gcc.
feat(common): use llama internal log
@vinovo vinovo closed this Sep 7, 2025
@vinovo vinovo deleted the tmp/paul/new branch September 7, 2025 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.