The LLM engine is a local inference server built on top of llama.cpp (via the llama-cpp-4 Rust crate). It follows an actor pattern:
Frontend (Svelte) ⇄ Tauri IPC commands ⇄ Axum HTTP server ⇄ Actor thread (owns model)
(OpenAI-compatible API)
A single dedicated OS thread ("llm-actor") owns the LlamaBackend, LlamaModel, and LlamaContext. Axum HTTP handlers and Tauri commands communicate with it via an unbounded mpsc channel (InferRequest → actor → InferToken stream back).
| File | Role |
|---|---|
src-tauri/src/llm/mod.rs |
Thin re-export layer over skill-llm; Tauri AppHandle adapter |
src-tauri/src/llm/cmds.rs |
Tauri commands: start/stop server, download/delete models, chat history |
crates/skill-llm/src/engine.rs |
Actor thread, inference loop, tool orchestration, Axum routes, image decoding |
crates/skill-llm/src/catalog.rs |
Model catalog (bundled JSON + HF Hub cache discovery + download logic) |
crates/skill-llm/src/chat_store.rs |
SQLite persistence for chat sessions |
crates/skill-tools/src/parse.rs |
Tool call parsing, extraction, validation, prompt injection |
crates/skill-tools/src/defs.rs |
Built-in tool definitions (JSON Schema specs) |
crates/skill-tools/src/exec.rs |
Tool execution (each tool's runtime implementation) |
crates/skill-tools/src/context.rs |
Context-aware history trimming |
crates/skill-tools/src/types.rs |
LlmToolConfig, ToolExecutionMode |
src-tauri/llm_catalog.json |
Canonical model list — add new models here only, no Rust changes needed |
src-tauri/src/settings.rs |
LlmConfig struct (all config knobs) |
| Flag | Effect |
|---|---|
llm |
Core: model loading + inference |
llm-metal |
Metal GPU offload (macOS) |
llm-cuda |
CUDA GPU offload (NVIDIA) |
llm-vulkan |
Vulkan GPU offload (cross-platform, used on Linux/Windows) |
llm-mtmd |
Multimodal: vision/audio via libmtmd |
| Method | Path | Description |
|---|---|---|
GET |
/health |
Liveness + model ready state |
GET |
/v1/models |
List loaded model |
POST |
/v1/chat/completions |
Chat (streaming SSE + JSON) |
POST |
/v1/completions |
Raw text completion |
POST |
/v1/embeddings |
Dense embeddings (mean pool) |
Go to Settings → LLM tab. The model catalog is displayed with families (Qwen3.5 4B/9B/27B, Gemma3, Phi4, Ministral, etc.) and quant options (Q2_K through Q8_0/BF16). Click Download on any entry.
Tray menu → Downloads… shows active/completed downloads with progress, pause/resume/cancel.
download_llm_model(filename)Tauri command → spawns a blocking HF Hub download task- Downloads use
hf_hubcrate to fetch from HuggingFace repos (e.g.bartowski/Qwen_Qwen3.5-4B-GGUF) - Files are cached in the standard HF Hub cache dir (
~/.cache/huggingface/hub/) - Progress is tracked via shared
DownloadProgressArc + polled by the frontend every ~2s - Tray icon gets a progress ring overlay during downloads
- Supports pause/resume/cancel
After downloading, if no model is active, the first downloaded recommended model is auto-selected. The catalog persists to ~/.skill/llm_catalog.json.
If you download a GGUF file externally into the HF Hub cache, click Refresh in the LLM settings — refresh_llm_catalog re-probes the disk cache.
From src-tauri/llm_catalog.json:
- Qwen3.5 — 4B, 9B, 27B, 35B-A3B (MoE) + distilled/fine-tuned variants
- Qwen3 VL 30B — vision-language model
- Gemma3 270M — tiny model
- GPT-OSS 20B, OmniCoder 9B, Phi4 Reasoning Plus
- Ministral 14B (instruct + reasoning)
- LFM2.5 VL 1.6B — small vision-language model
- Qwen2.5.1 Coder 7B, Qwen3 Coder Next
To add a new model, only edit llm_catalog.json — no Rust code changes required.
Vision requires the llm-mtmd feature flag at compile time and a multimodal projector (mmproj) file.
In the catalog, mmproj files are marked with "is_mmproj": true and tagged ["vision", "multimodal"]. They're available for Qwen3.5, Qwen3 VL, Ministral, LFM2.5 VL families. Download one alongside the matching text model (same repo).
- Auto-load (default):
autoload_mmprojdefaults totrueinLlmConfig. When the server starts, it automatically resolves the best downloaded mmproj from the same repo as the active text model. - Manual: Set the active mmproj via Settings → LLM or
set_llm_active_mmproj(filename). The system validates repo compatibility — it rejects mmproj files from a different repo than the active model.
run_actor()loads the mmproj viaMtmdContext::init_from_file()after loading the main model- On Linux, mmproj GPU offload is disabled by default for stability (CPU projector); set
SKILL_FORCE_MMPROJ_GPU=1to override - Loading is wrapped in
catch_unwindto survive native crashes from incompatible files
- In the Chat window, images can be included as base64 data-URLs in message content (OpenAI-compatible multipart content format:
{"type": "image_url", "image_url": {"url": "data:image/png;base64,…"}}) extract_images_from_messages()decodes all base64 images before passing to the actor- The actor uses
MtmdContextto encode images into embeddings interleaved with text tokens - Status is reported:
get_llm_server_status()returnssupports_vision: truewhen mmproj is loaded
Settings → LLM tab: Toggle the Enable switch. Select a model. Click Start.
start_llm_server()— spawns background model load, returns immediately ("starting")stop_llm_server()— gracefully shuts down actor threadget_llm_server_status()— returnsStopped | Loading | Running, plusn_ctx,supports_vision,supports_tools,start_error
- Validates model file exists
- Resolves mmproj (auto or explicit)
- Spawns
run_actoron a dedicated thread with 8MB stack - Actor: init backend → load model → create context → warmup → load mmproj → set
readyflag - Emits
llm:statusevents for frontend progress tracking
When no local model is ready, the app now defaults to:
LFM2.5 1.2B Instruct(prefersQ4_K_M, thenQ4_0)
Approximate requirements for this default model:
- GGUF file size:
~0.73 GB(Q4_K_M) - Estimated runtime memory @ 4K context:
~1.2–1.3 GB(weights + KV cache + runtime overhead) - Practical minimum (Windows): at least 4 GB free RAM/VRAM
- Recommended for stable use: 8 GB+ free memory
Autolaunch safety guard:
- Before auto-launch, the backend computes a hardware-fit estimate.
- Auto-launch is blocked when fit is
too_tightor available memory is below required memory. - In that case, startup returns a descriptive
start_errorwith required vs available memory.
| Setting | Description | Default |
|---|---|---|
enabled |
Master switch | false |
n_gpu_layers |
Layers on GPU (0 = CPU, u32::MAX = all) |
0 |
ctx_size |
Context window in tokens | 4096 |
parallel |
Max concurrent inference requests | 1 |
api_key |
Optional Bearer auth for API | None |
autoload_mmproj |
Auto-load vision projector on start | true |
mmproj_n_threads |
Threads for vision encoder | 4 |
no_mmproj_gpu |
Force CPU for mmproj | false |
verbose |
Show raw llama.cpp logs | false |
The chat supports 9 built-in tools that the LLM can invoke:
| Tool | Description |
|---|---|
date |
Current date/time + timezone |
location |
IP-based geolocation |
web_search |
DuckDuckGo search |
web_fetch |
Fetch URL content |
bash |
Execute shell commands (with safety checks + approval dialogs for dangerous ops) |
read_file |
Read file contents (with offset/limit pagination) |
write_file |
Create/overwrite files |
edit_file |
Surgical find-and-replace edits |
search_output |
Regex search over bash output files |
Tools are individually toggleable via LlmToolConfig in Settings → LLM. Dangerous bash commands (rm, sudo, etc.) and writes to sensitive paths (/etc/, /usr/, etc.) trigger a user approval dialog.
- Model generates
<tool_call>XML blocks in its output tools.rsparses the blocks intoToolCallstructsexecute_builtin_tool_call()dispatches by tool name- Results are injected back as
"tool"role messages (mapped to"user"role with[Tool Result]wrapper for template compatibility) - Model continues generation with the tool results in context
- Context-aware trimming:
trim_messages_to_fit()drops oldest messages to stay within 75% ofn_ctx - Tool result truncation: Long tool outputs are capped at 2 KB in history
- Compact tool prompt: Smaller prompt for contexts ≤4096 tokens
- Think budget:
thinking_budgetlimits tokens in<think>…</think>blocks (default: 512)
Chat sessions are stored in SQLite (~/.skill/chats/chat.db):
chat_sessionstable — session metadatachat_messagestable — role + content per messagechat_tool_callstable — tool name, status, args, result per invocation
Managed by src-tauri/src/llm/chat_store.rs. Sessions are created/loaded via get_last_chat_session, create_chat_session, save_chat_message Tauri commands.