UPSTREAM PR #16757: webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe #23

DajanaV · 2025-10-31T14:05:06Z

Extended MarkdownContent to flag previewable code languages,
add a preview button alongside copy controls, manage preview
dialog state, and share styling for the new button group

Introduced CodePreviewDialog.svelte, a sandboxed iframe modal
for rendering HTML/JS previews with consistent dialog controls

sandboxed-html-preview-AVC500kbps.mp4

…onditional rendering for Actions Dropdown for Chat Conversation Items (#16369) * fix: Render Conversation action dialogs as singletons from Chat Sidebar level * chore: update webui build output * fix: Render Actions Dropdown conditionally only when user hovers conversation item + remove unused markup * chore: Update webui static build * fix: Always truncate conversation names * chore: Update webui static build

* common: introduce http.h for httplib-based client This change moves cpp-httplib based URL parsing and client setup into a new header `common/http.h`, and integrates it in `arg.cpp` and `run.cpp`. It is an iteration towards removing libcurl, while intentionally minimizing changes to existing code to guarantee the same behavior when `LLAMA_CURL` is used. Signed-off-by: Adrien Gallouët <[email protected]> * tools : add missing WIN32_LEAN_AND_MEAN Signed-off-by: Adrien Gallouët <[email protected]> --------- Signed-off-by: Adrien Gallouët <[email protected]> Signed-off-by: Adrien Gallouët <[email protected]>

* CI: Properly install rocwmma for hip builds on windows we now windows install rocwmma from ubuntu pacakges * CI: update linux rocm docker build to use rocm 7.0

…16075) * Fix to use hidden_size_per_head * Fix num heads * Fix array * Fix loading weights * Support old GGUF converted by the previous version of llama.cpp * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Move shared parameter definitions to the outside of loop * Not calculating n_embd_head_k,v by n_embd / n_head --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

…0 (#16221) * HIP: Disable ROCWMMA fatt on CDNA when compiled against ROCWMMA 2.0.0 rocwmma 2.0.0 includes a bug in the code fakeing fp16 accumulation on CDNA * CUDA: Fix volta condition in ggml_cuda_should_use_wmma_fattn

* update oneapi to 2025.2, use deep-learning-essentials to replace base-tool * update to 2025.2 use deeplearn essi to replace base toolkit * add missed dll * add deep learning essentials * add sycl-ls --------- Co-authored-by: Zhang Jianyu <[email protected]>

Signed-off-by: Xiaodong Ye <[email protected]>

* First attempt * No permute during convert (fixes qk tensors), proper norm application. * RoPE = NeoX * Coherence! * Migrate xielu params from tensors to hyperparameters * Simple CUDA kernel * Revert stupid LLM refactorings * Chat template support * configchecker / flake8 errors * Reorder unary.cu * I do conclude that LLMs are, in fact, stupid. * Fix after merge * Final newline * Make xIELU an UNARY_OP * Final newline * Correctly account for parameter shift * Argh. * Update ggml/src/ggml-cpu/unary-ops.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Refactor: remove unused methods, inline and factorize softplus, add const modifiers * Revert CUDA changes, implement xIELU as a separate OP * Pesky newline * Add float2half / half2float for F16 inputs/outputs * CUDA variants, attempt 2 * Actually, attempt 3 * Update ggml/src/ggml-cuda/unary.cu Co-authored-by: Johannes Gäßler <[email protected]> * Missing convert header * Proper formula and reference for xIELU in the comments. * Modify unary-ops.cpp to add the functor-based logic besides the template system to retain optimizations * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]> * Add tensor mappings for Apertus to global list instead * Fix lazy on scalars * Update ggml/src/ggml-cuda/unary.cu Co-authored-by: Johannes Gäßler <[email protected]> * Add comment about the constraints on positive/negative alpha * Change `softplus` to `ggml_softplus` --------- Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

…389) * do not use more threads than physically available * ensure n_threads > 0 Co-authored-by: Jeff Bolz <[email protected]> --------- Co-authored-by: Jeff Bolz <[email protected]>

…rolling (#16356) Use <svelte:window bind:innerHeight> instead of manual resize listener Co-authored-by: Aleksander Grygier <[email protected]>

* fix: Include just the currently active message branches instead of all in chat completions request * chore: Build webui static output * chore: Formatting * chore: update webui build output

…GGML_KQ_MASK_PAD) (#16316)

…quest (#16405) * feat: Capture model name only after first token (streaming) or completed request (non-streaming) * chore: update webui build output * chore: update webui build output

This commit updates the macos-13 runners to macos-15-intel. The motivation for this changes is the macos-13 runners are scheduled to be retired on 2025-12-04. Refs: https://github.blog/changelog/2025-09-19-github-actions-macos-13-runner-image-is-closing-down/

When computing sinks, the cm1 shader was looping r from 0 to Br rather than to rows_per_thread. I must have copied this from the scalar path (where it is correct), and somehow it wasn't causing failures on current drivers.

…6354) * vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers. The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed beyond that limit. This allows > 4GB buffers to be allocated on some implementations (e.g. NVIDIA) and tensors this large can be used for im2col and mul_mat. For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange. I'm not sure this check is ideal, but we always use these buffers as a single full size binding and the limit may be smaller than maxMemoryAllocationSize or maxBufferSize, so I think this is reasonable. Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range. The maxStorageBufferRange may be smaller than the maxBufferSize or maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and it's invalid usage if VK_WHOLE_SIZE computes a range larger than maxStorageBufferRange. With this change, it should be possible to generate videos using wan networks in stable-diffusion.cpp. * vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull

* fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs * chore: update webui build output * chore: update webui build output

reallocation is needed if a single chunk grows in size, even if total allocation size stays the same or is lower

* initial commit for branch 3 * generalize `swa_checkpoint` to `ctx_checkpoint` this extends `llama-server`'s SWA checkpointing logic to include hybrid/recurrent models such as Jamba, Granite * oops * disable debug prints * keep backwards compat with `--swa-checkpoints` Co-authored-by: Georgi Gerganov <[email protected]> * update prompt re-processing message * fix off-by-one error per GG * keep `seq_rm` log per GG Co-authored-by: Georgi Gerganov <[email protected]> * server : fix checkpoint logic to support recurrent caches * server : cleanup and fixes --------- Co-authored-by: Georgi Gerganov <[email protected]>

* feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls * feat: new flow in the chat template test suite for Magistral

* vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times * support dep-files so shaders are recompiled if their included files change * rename shader files which are used as "headers" to use .glsl extension * move glslc extension detection shaders to separate folders * the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled * vulkan : only write embedded shader .hpp/.cpp when they change * avoid recompiling ggml-vulkan.cpp when editing shaders * pass single --source argument instead of --input-dir & --filter to shader gen * check for source file match earlier * fix hang in vulkan-shaders-gen when there are compilation errors * early out did not decrement compile_count * clean up * fix glslc integer dot product test * unconditionally write the embedded shader cpp output * replace output filepath in generated dep-files to match output in CMakeLists --------- Co-authored-by: Jeff Bolz <[email protected]>

* rpc : add support for multiple devices Allow rpc-server to expose multiple devices from a single endpoint. Change RPC protocol to include device identifier where needed. closes: #15210 * fixes * use ggml_backend_reg_t * address review comments * fix llama-bench backend report * address review comments, change device naming * fix cmd order

…og.svelte Co-authored-by: Aleksander Grygier <[email protected]>

….svelte Co-authored-by: Aleksander Grygier <[email protected]>

loci-agentic-ai · 2025-10-31T15:17:17Z

Access the complete analysis in the LOCI Dashboard

Based on the performance analysis conducted for project 2621b8c0-b5ce-11f0-b333-453f42058aa1 comparing version cc0e41de-4e40-43e3-8ff3-249ad81c686c against base version 594f77ca-22bd-452c-8517-681d48c6e069, here are the key findings:

Performance Summary

Critical Function Changes

The analysis identified minimal performance degradations in non-critical functions:

~_Scanner destructor: +0.057% response time (+15 ns), +0.080% throughput (+15 ns)
Quantization lambda __invoke: +0.116% bottleneck (+18 ns)

Power Consumption Analysis

Total change: Negligible across all binaries (0.0% to +0.0003%)
Affected binary: build.bin.libllama.so shows minimal increase (+0.0003%)
Other binaries: No measurable changes in GGML libraries

KPI Impact Assessment

1. Tokens Per Second

Status: No impact detected

Critical functions analyzed: No changes found in core inference functions
- llama_decode(): Not identified in degradation analysis
- llama_encode(): Not identified in degradation analysis
- llama_tokenize(): Not identified in degradation analysis
Reference impact: Given that 2ms slower llama_decode() reduces tokens/second by 7% on the reference system (ollama://smollm:135m, 12th Gen Intel i7-1255U, Ubuntu 24.04.3), the observed nanosecond-level changes in non-inference functions will not affect tokenization throughput

2. Power Consumption

Status: Minimal impact

Affected binary: build.bin.libllama.so (+0.0003% increase)
Unchanged binaries:
- build.bin.libggml-base.so (0.0% change)
- build.bin.libggml-cpu.so (0.0% change)
- build.bin.libggml.so (0.0% change)
Root cause: Micro-degradations in destructor and quantization helper functions

3. Quantization Efficiency

Status: Minimal impact

Affected function: Quantization lambda __invoke in llama-quant.cpp
Change: +18 ns bottleneck time (+0.116%)
Impact scope: Internal quantization processing, not affecting llama_model_quantize() API performance

4. Memory Usage

Status: No impact detected

Memory management functions: No changes identified in:
- llama_memory_clear()
- llama_memory_seq_rm()
- llama_memory_seq_cp()
- KV cache operations
Assessment: Memory allocation patterns remain unchanged

5. Batch Processing

Status: No impact detected

Batch processing functions: No changes identified in:
- llama_batch_init()
- llama_batch_get_one()
- llama_batch_free()
- llama_decode() with batches
Assessment: Parallel token processing efficiency maintained

Root Cause Analysis

Assembly Code Analysis

The control flow graph comparison revealed identical assembly code between versions for the degraded ~_Scanner function, indicating the performance changes stem from:

Build environment differences: Compiler version or optimization flag variations
System library changes: Updated C++ standard library implementations
Memory layout variations: Different ASLR or linking configurations affecting cache behavior

Code Change Correlation

The concurrent PR #23 contains only frontend Svelte/HTML modifications with no C++ code changes, confirming the performance degradations are build-environment related rather than code-driven.

Action Items

Build System Optimization

Standardize build environment: Ensure consistent compiler versions and optimization flags across builds
Link-time optimization: Consider enabling LTO to reduce PLT call overhead in destructor functions
Static linking evaluation: Assess static linking benefits for reducing dynamic library call overhead

Performance Monitoring

Baseline establishment: Implement consistent measurement protocols using hardware performance counters
Build reproducibility: Establish controlled build environments to eliminate measurement noise
Function-level tracking: Monitor core inference functions (llama_decode, llama_encode, llama_tokenize) for actual performance impacts

Code Quality Maintenance

Template optimization: Review C++ standard library template instantiations for potential compiler-specific optimizations
Memory layout analysis: Evaluate object member ordering for cache locality improvements in frequently destructed objects

The analysis confirms that core LLaMA.cpp inference performance remains unaffected, with observed degradations limited to auxiliary functions and likely attributable to build environment variations rather than functional regressions.

allozaur and others added 30 commits October 1, 2025 18:18

ci: Properly install rocwmma for hip builds (#16305)

1fe4e38

* CI: Properly install rocwmma for hip builds on windows we now windows install rocwmma from ubuntu pacakges * CI: update linux rocm docker build to use rocm 7.0

CI: reenable cdna in rocm docker builds (#16376)

c8dedc9

HIP: add IMbackK to codeowner (#16375)

95ce098

ci : fix clean-up of old logs (#16381)

bbd32bc

ci: update vulkan ci (#16294)

f09aefa

ci : fix ubuntu-latest-cmake-rpc (disable ccache) (#16388)

72ee736

musa: update compile flags (#16265)

91a2a56

Signed-off-by: Xiaodong Ye <[email protected]>

test-barrier : do not use more threads than physically available (#16…

d64c810

…389) * do not use more threads than physically available * ensure n_threads > 0 Co-authored-by: Jeff Bolz <[email protected]> --------- Co-authored-by: Jeff Bolz <[email protected]>

fix: track viewportHeight via window.innerHeight to avoid unwanted sc…

5113efd

…rolling (#16356) Use <svelte:window bind:innerHeight> instead of manual resize listener Co-authored-by: Aleksander Grygier <[email protected]>

webui : Fix messages payload sent to chat completions (#16402)

136bda7

* fix: Include just the currently active message branches instead of all in chat completions request * chore: Build webui static output * chore: Formatting * chore: update webui build output

vulkan: in flash attention, bounds check against nem1 (don't rely on …

e308efd

…GGML_KQ_MASK_PAD) (#16316)

Capture model name only after first token (streaming) or completed re…

7723327

…quest (#16405) * feat: Capture model name only after first token (streaming) or completed request (non-streaming) * chore: update webui build output * chore: update webui build output

vulkan: Fix FA coopmat1 invalid array indexing (#16365)

0e1f838

When computing sinks, the cm1 shader was looping r from 0 to Br rather than to rows_per_thread. I must have copied this from the scalar path (where it is correct), and somehow it wasn't causing failures on current drivers.

Fix missing messages on sibling navigation (#16408)

84c8e30

* fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs * chore: update webui build output * chore: update webui build output

ggml : fix graph reallocation with multiple chunks (#16396)

638d330

reallocation is needed if a single chunk grows in size, even if total allocation size stays the same or is lower

llama : fix shapes for bert/mpt q/k norm (#16409)

946f71e

metal : fix loop bound in ggml_mem_ranges (#16412)

606a73f

chat : support Magistral thinking (#16413)

128d522

* feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls * feat: new flow in the chat template test suite for Magistral

ServeurpersoCom and others added 5 commits October 31, 2025 13:09

Update tools/server/webui/src/lib/components/app/misc/CodePreviewDial…

9163aa3

…og.svelte Co-authored-by: Aleksander Grygier <[email protected]>

Update tools/server/webui/src/lib/components/app/misc/MarkdownContent…

2bef3b5

….svelte Co-authored-by: Aleksander Grygier <[email protected]>

webui: pedantic style tweak for CodePreviewDialog close button

1a950eb

webui: remove overengineered preview language logic

9593ba0

chore: update webui static build

948802f

DajanaV temporarily deployed to PROD__AL_DEMO October 31, 2025 14:05 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 21 times, most recently from b655780 to 94ec54d Compare November 3, 2025 20:09

DajanaV closed this Nov 3, 2025

DajanaV force-pushed the main branch from 94ec54d to 92c0c2f Compare November 3, 2025 23:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #16757: webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe #23

UPSTREAM PR #16757: webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe #23

Uh oh!

DajanaV commented Oct 31, 2025

Uh oh!

loci-agentic-ai bot commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

88 participants