Releases · YannFollet/llama.cpp

14 Oct 06:40

bc07349

b6756 Latest

Latest

server : dynamic token limit for prompt cache (#16560)

* server : dynamic token limit for prompt cache

* cont : print estimated token limit

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2025-10-14T06:40:17Z
llama-b6756-bin-macos-arm64.zip

sha256:704f1d97aec7a5529f164f75a7bba39390305535d5cec1a1a9733a56852a01c0

10.4 MB 2025-10-14T06:40:27Z
llama-b6756-bin-macos-x64.zip

sha256:3133d48eecbd1c3fd8f2084d4e1a6da5d23ab7fbf9876f64f9a2c56689b2161f

27 MB 2025-10-14T06:40:28Z
llama-b6756-bin-ubuntu-vulkan-x64.zip

sha256:5971f3fd6dc8feb113b96a371c6f414d97cc3a94d6e2617afbc4cdf08940538d

25.6 MB 2025-10-14T06:40:30Z
llama-b6756-bin-ubuntu-x64.zip

sha256:b22445c3fc43cee6286cbfdb3b773ccd923bc8b4c583e97bfa63df5d073a684f

12.5 MB 2025-10-14T06:40:31Z
llama-b6756-bin-win-cpu-arm64.zip

sha256:002f58ca9d82dc38f6520507748b460a7d8372feac45c4a83d075e13efed32bf

10.6 MB 2025-10-14T06:40:32Z
llama-b6756-bin-win-cpu-x64.zip

sha256:578327bcb721c5b7bddfeb4efc22bce29e67cb76fc5a47b107e02955915b1f4c

13.6 MB 2025-10-14T06:40:33Z
llama-b6756-bin-win-cuda-12.4-x64.zip

sha256:5bbd737db352a34ca74ddd8c1b297b4769ef428205b1a6c07994d14ae9ee2da2

161 MB 2025-10-14T06:40:34Z
llama-b6756-bin-win-hip-radeon-x64.zip

sha256:8b5fcea5402f8f24e7346cc18852bbecb20021135e50fc5f4c038449a9704222

321 MB 2025-10-14T06:40:39Z
llama-b6756-bin-win-opencl-adreno-arm64.zip

sha256:c57cfebddd1e8a74be7b99a957a377873420c399566c5e3edfedf06f02c21a5b

11 MB 2025-10-14T06:40:48Z
Source code (zip)

2025-10-14T05:48:50Z
Source code (tar.gz)

2025-10-14T05:48:50Z

12 Oct 01:22

github-actions

b6736

11f0af5

b6736

CUDA: faster tile FA, add oob checks, more HSs (#16492)

Assets 15

09 Oct 02:27

github-actions

b6715

12bbc3f

b6715

refactor: centralize CoT parsing in backend for streaming mode (#16394)

* refactor: unify reasoning handling via backend reasoning_content, drop frontend tag parsing

- Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing
- Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops
- Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic
- Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages

* refactor: implement streaming-aware universal reasoning parser

Remove the streaming mode limitation from --reasoning-format by refactoring
try_parse_reasoning() to handle incremental parsing of <think> tags across
all formats.

- Rework try_parse_reasoning() to track whitespace, partial tags, and
  multiple reasoning segments, allowing proper separation of reasoning_content
  and content in streaming mode
- Parse reasoning tags before tool call handling in content-only and Llama 3.x
  formats to ensure inline <think> blocks are captured correctly
- Change default reasoning_format from 'auto' to 'deepseek' for consistent
  behavior
- Add 'deepseek-legacy' option to preserve old inline behavior when needed
- Update CLI help and documentation to reflect streaming support
- Add parser tests for inline <think>...</think> segments

The parser now continues processing content after </think> closes instead of
stopping, enabling proper message.reasoning_content and message.content
separation in both streaming and non-streaming modes.

Fixes the issue where streaming responses would dump everything (including
post-thinking content) into reasoning_content while leaving content empty.

* refactor: address review feedback from allozaur

- Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component
- Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse
- Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples

Co-authored-by: Aleksander Grygier <[email protected]>

* refactor: restore forced reasoning prefix to pass test-chat ([chat] All tests passed)

- store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block
- inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication
- repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows

* refactor: address review feedback from ngxson

* debug: say goodbye to curl -N, hello one-click raw stream

- adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte

Co-authored-by: Aleksander Grygier <[email protected]>

* webui: add Storybook example for raw LLM output and scope reasoning format toggle per story

- Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample
- Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example

* npm run format

* chat-parser: address review feedback from ngxson

Co-authored-by: Xuan Son Nguyen <[email protected]>

---------

Co-authored-by: Aleksander Grygier <[email protected]>
Co-authored-by: Xuan Son Nguyen <[email protected]>

Assets 15

03 Oct 07:08

github-actions

b6674

5113efd

b6674

fix: track viewportHeight via window.innerHeight to avoid unwanted sc…

Assets 15

02 Oct 06:41

github-actions

b6665

95ce098

b6665

HIP: add IMbackK to codeowner (#16375)

Assets 15

29 Aug 02:23

github-actions

b6315

e8d99dd

b6315

nvidia nemotron nano v2 (nemotronh) (#15507)

* feat: Add NEMOTRONH to python arch enum

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Add NEMOTRONH to c++ arch enum

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Add NEMOTRONH to llama-arch layer map

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: First pass at conversion for nemotronh

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Add a verbose log for each tensor loaded

This is really helpful for diagnosing mismatches between the expected and
received tensors

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: First (broken) pass at nemotronh model architecture

It generates tokens, just not valid ones!

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Explicitly enable add_bos_token during conversion

The `tokenizer.json`/`tokenizer_config.json` in the model are a bit
contradictory. In the config, add_bos_token is set to False, but the
tokenizer model itself has a post_processor that adds the BOS token via
type: TemplateProcessing

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Use relu2 (LLM_FFN_RELU_SQR) for activation in FFN layers

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Only allocate attention cache for attention layers (not non-recurrent)

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Move residual add to after every block

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Use the correct norm tensor for the MLP blocks

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <[email protected]>

* Nemotron-H: MLP gate cleanup (pass NULL for unused gate)

This model does not use a gate in MLP blocks; pass NULLs for gate tensors to make intent clear and avoid unused-pointer noise.

* SSM: respect ssm_dt_rank for dt_dim when provided

Use GGUF-provided time_step_rank (ssm_dt_rank) to set dt_dim when > 0; fallback to max(64, n_embd/16).

* fix: plamo2 - revert dt_dim to default (remove ssm_dt_rank usage)

* Rename nemotronh to nemotron_h for consistency

- Update architecture name from NEMOTRONH to NEMOTRON_H in constants.py
- Change architecture string from 'nemotronh' to 'nemotron_h' in all files
- Update enum LLM_ARCH_NEMOTRONH to LLM_ARCH_NEMOTRON_H
- Update class name llm_build_nemotronh to llm_build_nemotron_h
- Consistent naming with underscore convention (nemotron_h vs nemotronh)

* feat: Support conversion for older NemotronH models

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <[email protected]>

---------

Signed-off-by: Gabe Goodhart <[email protected]>
Co-authored-by: Maicon Domingues <[email protected]>
Co-authored-by: weatherman <[email protected]>

Assets 15

11 Aug 08:37

github-actions

b6123

79c1160

b6123

cuda: refactored ssm_scan and use CUB (#13291)

* cuda: refactored ssm_scan to use CUB

* fixed compilation error when when not using CUB

* assign L to constant and use size_t instead of int

* deduplicated functions

* change min blocks per mp to 1

* Use cub load and store warp transpose

* suppress clang warning

Assets 15

07 Aug 04:41

github-actions

b6107

36d3f00

b6107

requirements : fix PyTorch uint64 compatibility (#15134)

This commit addresses an issue with the convert_hf_to_gguf script
which is currently failing with:
```console
AttributeError: module 'torch' has no attribute 'uint64'
```

This occurred because safetensors expects torch.uint64 to be available
in the public API, but PyTorch 2.2.x only provides limited support for
unsigned types beyond uint8 it seems. The torch.uint64 dtype exists but
is not exposed in the standard torch namespace
(see pytorch/pytorch#58734).

PyTorch 2.4.0 properly exposes torch.uint64 in the public API, resolving
the compatibility issue with safetensors. This also required torchvision
to updated to =0.19.0 for compatibility.

Refs: https://huggingface.co/spaces/ggml-org/gguf-my-repo/discussions/186#68938de803e47d990aa087fb
Refs: https://github.com/pytorch/pytorch/issues/58734

Assets 15

06 Aug 03:14

github-actions

b6097

9515c61

b6097

ggml: WebGPU disable SET_ROWS for now (#15078)

* Add paramater buffer pool, batching of submissions, refactor command building/submission

* Add header for linux builds

* Free staged parameter buffers at once

* Format with clang-format

* Fix thread-safe implementation

* Use device implicit synchronization

* Update workflow to use custom release

* Remove testing branch workflow

* Disable set_rows until it's implemented

* Fix potential issue around empty queue submission

* Try synchronous submission

* Try waiting on all futures explicitly

* Add debug

* Add more debug messages

* Work on getting ssh access for debugging

* Debug on failure

* Disable other tests

* Remove extra if

* Try more locking

* maybe passes?

* test

* Some cleanups

* Restore build file

* Remove extra testing branch ci

Assets 15

31 Jul 07:30

github-actions

b6040

66625a5

b6040

graph : reduce splits for recurrent and hybrid models (#14825)

* graph : avoid creating redundant s_copy views

* graph : comment the s_copy views

Assets 15

Releases: YannFollet/llama.cpp

b6756

Uh oh!

b6736

Uh oh!

b6715

Uh oh!

b6674

Uh oh!

b6665

Uh oh!

b6315

Uh oh!

b6123

Uh oh!

b6107

Uh oh!

b6097

Uh oh!

b6040

Uh oh!