-
Notifications
You must be signed in to change notification settings - Fork 0
Tmp/paul/new #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Tmp/paul/new #18
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Add paramater buffer pool, batching of submissions, refactor command building/submission * Add header for linux builds * Free staged parameter buffers at once * Format with clang-format * Fix thread-safe implementation * Use device implicit synchronization * Update workflow to use custom release * Remove testing branch workflow * Disable set_rows until it's implemented * Fix potential issue around empty queue submission * Try synchronous submission * Try waiting on all futures explicitly * Add debug * Add more debug messages * Work on getting ssh access for debugging * Debug on failure * Disable other tests * Remove extra if * Try more locking * maybe passes? * test * Some cleanups * Restore build file * Remove extra testing branch ci
* feat(cann): add optional support for ACL Graph execution This commit adds support for executing ggml computational graphs using Huawei's ACL graph mode via the USE_CANN_GRAPH flag. The support can be enabled at compile time using the CMake option: -DUSE_CANN_GRAPH=ON By default, ACL graph execution is **disabled**, and the fallback path uses node-by-node execution. Key additions: - CMake option to toggle graph mode - Graph capture and execution logic using - Tensor property matching to determine whether graph update is required - Safe fallback and logging if the environment variable LLAMA_SET_ROWS is unset or invalid This prepares the backend for performance improvements in repetitive graph execution scenarios on Ascend devices. Signed-off-by: noemotiovon <[email protected]> * Fix review comments Signed-off-by: noemotiovon <[email protected]> * remane USE_CANN_GRAPH to USE_ACL_GRAPH Signed-off-by: noemotiovon <[email protected]> * fix typo Signed-off-by: noemotiovon <[email protected]> --------- Signed-off-by: noemotiovon <[email protected]>
Signed-off-by: stevenkuang <[email protected]>
* opencl: add `swiglu-oai` * opencl: add `add_id` * opencl: add missing `add_id.cl`
* Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments
This commit addresses an issue with the convert_hf_to_gguf script which is currently failing with: ```console AttributeError: module 'torch' has no attribute 'uint64' ``` This occurred because safetensors expects torch.uint64 to be available in the public API, but PyTorch 2.2.x only provides limited support for unsigned types beyond uint8 it seems. The torch.uint64 dtype exists but is not exposed in the standard torch namespace (see pytorch/pytorch#58734). PyTorch 2.4.0 properly exposes torch.uint64 in the public API, resolving the compatibility issue with safetensors. This also required torchvision to updated to =0.19.0 for compatibility. Refs: https://huggingface.co/spaces/ggml-org/gguf-my-repo/discussions/186#68938de803e47d990aa087fb Refs: pytorch/pytorch#58734
* CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16
Any available libraries are found and loaded dynamically at runtime.
…age metrics (#15103)
* support internvl * support interns1 * resolve comments * put interns1 in tensor mapping * resolve comment * move tokenizer changes to sub class
* convert : support non-mxfp4 HF model * rm redundant check * disable debug check
* vendor: sync minja * Update minja.hpp * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>
* server-bench: external OAI servers, sqlite * Update scripts/server-bench.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update scripts/server-bench.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update scripts/server-bench.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * raise_for_status --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>
* gguf-py : add MXFP4 de/quantization support * ggml-quants : handle zero amax for MXFP4
* CUDA: add attention sinks for tile and wmma * Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma
* cuda: refactored ssm_scan to use CUB * fixed compilation error when when not using CUB * assign L to constant and use size_t instead of int * deduplicated functions * change min blocks per mp to 1 * Use cub load and store warp transpose * suppress clang warning
* kleidiai: fix unsigned overflow bug * address review comments
* Improve Mistral models integration with llama.cpp * Revert changes and fix gguf * Revert change * refactor convert_mistral_to_gguf.py in convert_hf_to_gguf.py * Revert collateral * Rename model name * refactor * revert * remove duplicate * Remove duplication code * Fixes * Fix flake issues * Apply comments * Apply comments * Apply comments * Fix remote * add default chat template * Revert * nit
This commit updates comments and error messages to use "decode" instead of "eval" in perplexity.cpp. The motivation for this is that `llama_eval` was renamed to `llama_decode` a while ago, but the comments and error messages still referred to "eval". This change ensures consistency and clarity.
* chat : clarify the meaning of reasoning_format * add link to this PR
…5385) * Added VSX intrinsics for Power9+ systems Signed-off-by: mgiessing <[email protected]> * Manual unrolling for minor perf improvement Signed-off-by: mgiessing <[email protected]> * Update ggml/src/ggml-cpu/arch/powerpc/quants.c Co-authored-by: Georgi Gerganov <[email protected]> --------- Signed-off-by: mgiessing <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
…15413) Signed-off-by: Xiaodong Ye <[email protected]>
* optimize rope ops * amendment * delete trailing whitespace * change the variable name
* server : disable context shift by default ggml-ci * server : make scopr of test parameters local
* musa: fix build warnings Signed-off-by: Xiaodong Ye <[email protected]> * fix warning: comparison of integers of different signs: 'const int' and 'unsigned int' [-Wsign-compare] Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>
* lookahead : add sample command to readme * cont : build-agnostic command
This commit removes the content from the Makefile and updates the current deprecation message to information that `make` has been replaced by CMake instead. The message when `make` is invoked will now be the following: ```console $ make Makefile:6: *** Build system changed: The Makefile build has been replaced by CMake. For build instructions see: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md . Stop. ``` The motivation for this is that many, if not all targets fail to build now, after changes to the system, and `make` has also been deprected for some time now.
* Update docker.yml 修改docker.yml文件中的内容使其停止周期性的运行该workflow,如果想要运行该workflow可以手动启动 * feat:Modify the header file include path 1. There's no llava directory in the tools directory. 2. Because the command `target_include_directories(mtmd PUBLIC .)` is used in the `mtmd` CMakeLists.txt file, other targets that link against `mtmd` automatically include the `mtmd` directory as a search path for header files. Therefore, you can remove `target_include_directories(${TARGET} PRIVATE ../llava`` or use `target_include_directories(${TARGET} PRIVATE ../mtmd`` to explicitly require the `llama-server` target to use header files from `mtmd`. * Restore the docker.yml file
Signed-off-by: Jie Fu <[email protected]>
This commit addresses an inconsistency during inference by adding a new member to the `templates_params` struct to indicate whether the chat is in inference mode. This allows the gpt-oss specific function `common_chat_params_init_gpt_oss` to check this flag and the `add_generation_prompt` flag to determine if it should replace the `<|return|>` token with the `<|end|>` token in the prompt. The motivation for this change is to ensure that the formatted prompt of past messages in `common_chat_format_single` matches the output of the formatted new message. The issue is that the gpt-oss template returns different end tags: `<|return|>` when `add_generation_prompt` is false, and `<|end|>` when `add_generation_prompt` is true. This causes the substring function to start at an incorrect position, resulting in tokenization starting with 'tart|>' instead of '<|start|>'. Resolves: ggml-org/llama.cpp#15417
These detailed strings were causing increased build time on gcc.
Remilia/chore/update upstream
This reverts commit 04a2630.
feat(common): use llama internal log
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Make sure to read the contributing guidelines before submitting a PR