Skip to content

Commit 82c0804

Browse files
author
Daniel Bank
authored
Adding 20250709 meetup notes (#56)
1 parent 0f6e100 commit 82c0804

2 files changed

Lines changed: 221 additions & 0 deletions

File tree

content/2025-07-09.md

Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
+++
2+
title = "Rust <> AI"
3+
date = 2025-07-09
4+
draft = false
5+
in_search_index = true
6+
template = "page.html"
7+
8+
[taxonomies]
9+
tags = ["Rust", "AI", "LLMs", "Hugging Face"]
10+
categories = ["meetups"]
11+
+++
12+
13+
![Llama Crab Emoji](https://github.com/azdevs/desert-rustaceans/raw/master/static/emojis/rust_llama.png)
14+
15+
Topics:
16+
17+
- Candle a Minimalist ML Framework via [@MasonStallmo](https://github.com/mstallmo)
18+
- 🦙🦀 Tauri-Served Local LLMs with Mistral.rs via [@DanielPBank](https://github.com/danielbank)
19+
20+
<!-- more -->
21+
22+
# Candle a Minimalist ML Framework via [@MasonStallmo](https://github.com/mstallmo)
23+
24+
Slides: [local_first_security.pdf](https://github.com/azdevs/desert-rustaceans/raw/master/static/candle-2025-07-09.pdf)
25+
26+
Candle is a lightweight, fast ML framework written in Rust that aims to provide a familiar PyTorch-like experience while addressing the performance and deployment limitations of Python-based frameworks. Created by Hugging Face, Candle is designed to be particularly well-suited for serverless ML deployments and browser-based inference.
27+
28+
## Key Features and Performance
29+
30+
Candle significantly outperforms traditional Python frameworks in both memory usage and speed. According to benchmark data, Candle uses only 3.2GB peak RAM compared to torch-rs's 4.7GB, with substantially lower memory growth (18 MB/min vs 42 MB/min). Performance benchmarks show Candle executing BERT Base in 8.3ms compared to torch-rs's 15.7ms, and LLaMA 2 7B inference at 45.2ms/token versus 72.8ms/token.
31+
32+
```rust
33+
// Example: Basic tensor operations in Candle
34+
use candle_core::{Device, Tensor};
35+
36+
let device = Device::Cpu;
37+
let a = Tensor::new(&[[1f32, 2.], [3., 4.]], &device)?;
38+
let b = Tensor::new(&[[5f32, 6.], [7., 8.]], &device)?;
39+
let result = a.matmul(&b)?;
40+
```
41+
42+
## Unique Advantages
43+
44+
The framework's most distinctive feature is its ability to compile models to WebAssembly (WASM), enabling ML inference directly in browsers without server dependencies—something impossible with PyTorch. This makes Candle particularly valuable for privacy-conscious applications and edge computing scenarios. The framework provides all essential ML components including model structure, weight serialization, training capabilities with optimizers and data loaders, backpropagation, and inference engines.
45+
46+
```rust
47+
// Example: MNIST training loop structure
48+
for epoch in 1..=epochs {
49+
let mut sum_loss = 0f32;
50+
for (bimages, blabels) in train_iter {
51+
let logits = model.forward(&bimages)?;
52+
let loss = loss::cross_entropy(&logits, &blabels)?;
53+
optimizer.backward_step(&loss)?;
54+
sum_loss += loss.to_vec0::<f32>()?;
55+
}
56+
}
57+
```
58+
59+
While Candle trades some of PyTorch's extensive feature set for simplicity and performance, it maintains familiar APIs that make it approachable for PyTorch users. The framework supports popular models and can be explored through various [Hugging Face demos](https://huggingface.co/spaces/lmz/candle-yolo) showcasing YOLO, Whisper, LLaMA 2, and other models running entirely in the browser.
60+
61+
# 🦙🦀 Tauri-Served Local LLMs with Mistral.rs via [@DanielPBank](https://github.com/danielbank)
62+
63+
Repo: [https://github.com/danielbank/tauri-mistral-chat](https://github.com/danielbank/tauri-mistral-chat)
64+
65+
Daniel built a simple desktop chatbot demo with [Tauri](https://v2.tauri.app/) (a cross-platform framework), React (frontend JS framework), and [mistral.rs](https://github.com/EricLBuehler/mistral.rs) (a cross-platform, highly-multimodal inference engine written in Rust). The demo integrates Mistral AI models for local inference.
66+
67+
## Hidden Gems in the Mistral.rs Documentation
68+
69+
The [mistral.rs Docs](https://ericlbuehler.github.io/mistral.rs/mistralrs/) can be a little hard to navigate. Here are a few things Daniel found REALLY HELPFUL:
70+
71+
### Rust Examples!
72+
73+
The [Rust examples](https://github.com/EricLBuehler/mistral.rs/tree/master/mistralrs/examples) are all here and are a good starting point for simple programs that demonstrate the models
74+
75+
### ❗ Chat Templates (IMPORTANT)
76+
77+
You will need to specify [a chat template](https://github.com/EricLBuehler/mistral.rs/tree/master/chat_templates) ([e.g. `mistral.json`](https://github.com/danielbank/tauri-mistral-chat/blob/main/src-tauri/examples/hello_world.rs#L187-L192)) with your model builder:
78+
79+
```rs
80+
builder = builder.with_chat_template(template_path);
81+
```
82+
83+
These templates are readily available in the mistral.rs repo, but you have to look for them: https://github.com/EricLBuehler/mistral.rs/tree/master/chat_templates
84+
85+
As an aside, you can use remote tokenizer as backup: `.with_tok_model_id("mistralai/Mistral-7B-Instruct-v0.1")`
86+
87+
## Model Management
88+
89+
### Downloading Models
90+
91+
The demo features two examples in `./src-tauri` which are meant to demonstrate Mistral.rs inference with a Local LLM without the added complexity of Tauri. The first example downloads relevant models and the second example runs the inference:
92+
93+
```bash
94+
cd src-tauri
95+
cargo run --example download-models list
96+
cargo run --example download_models download llama-vision --force --yes
97+
cargo run --example hello_world
98+
```
99+
100+
### Instantiating a Model
101+
102+
Each model type has a different builder. For example, the `TextModelBuilder` is used for text-based models:
103+
104+
```rs
105+
async fn load_remote_smollm3_model() -> Result<mistralrs::Model, String> {
106+
println!("Loading remote SmolLM3 3B model...");
107+
108+
// Build the remote SmolLM3 model using TextModelBuilder
109+
let model = TextModelBuilder::new("HuggingFaceTB/SmolLM3-3B")
110+
.with_isq(IsqType::Q8_0)
111+
.with_logging()
112+
.build()
113+
.await
114+
.map_err(|e: anyhow::Error| format!("Failed to build remote SmolLM3 model: {}", e))?;
115+
116+
println!("Remote SmolLM3 model loaded successfully!");
117+
Ok(model)
118+
}
119+
```
120+
121+
Different formats of the same model type also have different builders. For example, to load a text-model using the UQFF format, you would use the `UqffTextModelBuilder`:
122+
123+
```rs
124+
async fn load_remote_llama_uqff_model() -> Result<mistralrs::Model, String> {
125+
println!("Loading remote Llama 3B UQFF model...");
126+
127+
let model = UqffTextModelBuilder::new(&model_path, uqff_files)
128+
.into_inner()
129+
.with_isq(IsqType::Q5_0)
130+
.with_logging()
131+
.build()
132+
.await
133+
.map_err(|e: anyhow::Error| format!("Failed to build Llama UQFF text model: {}", e))?;
134+
135+
println!("Llama UQFF text model loaded successfully!");
136+
return Ok(model);
137+
```
138+
139+
### Inference
140+
141+
Once you have a model, you can run inference with the `Model` struct. For example, to run inference with a text-model, you would use the `Model::chat` method:
142+
143+
```rs
144+
async fn run_inference(model: mistralrs::Model, message: String) -> Result<String, String> {
145+
let messages = TextMessages::new()
146+
.add_message(
147+
TextMessageRole::User,
148+
&format!("You are a helpful AI assistant. Keep your responses concise and friendly.\n\n{}", message)
149+
);
150+
151+
model
152+
.send_chat_request(messages)
153+
.await
154+
.map_err(|e| format!("Failed to send text chat request: {}", e))?
155+
}
156+
```
157+
158+
## Integration with the Frontend
159+
160+
The Tauri application exposes commands to the frontend JavaScript to run the inference: [discover_models](https://github.com/danielbank/tauri-mistral-chat/blob/main/src-tauri/src/lib.rs#L59-L174) and [ai_chat](https://github.com/danielbank/tauri-mistral-chat/blob/main/src-tauri/src/lib.rs#L304-L384). The frontend is a simple React app that uses the Tauri API to call the Rust functions:
161+
162+
```javascript
163+
// Call Tauri backend
164+
console.log("Calling Tauri backend with:");
165+
console.log("- message:", message.content);
166+
console.log("- modelId:", modelId);
167+
console.log("- hasImage:", !!imageData);
168+
console.log("- imageDataLength:", imageData?.length || 0);
169+
170+
const response =
171+
(await invoke) <
172+
string >
173+
("ai_chat",
174+
{
175+
message: message.content,
176+
modelId: modelId,
177+
imageData: imageData,
178+
});
179+
180+
console.log("Received response from Tauri backend:", response);
181+
```
182+
183+
## Universal Quantized File Format (UQFF) and GGML Universal File (GGUF) Format
184+
185+
### What is UQFF?
186+
187+
Think of UQFF as a new way to package AI models so they run faster and use less computer memory. It's like having a ZIP file specifically designed for AI models. Specifically, it uses a technique called "quantization" to compress AI models to make them smaller and faster - kind of like how you might compress a video file to make it smaller.
188+
189+
#### What Makes UQFF Special
190+
191+
- **One File, Multiple Options** - Instead of having separate files for different compression levels, UQFF lets you pack multiple compression types into one file. It's like having a ZIP file that contains both the HD version and the compressed version of a movie.
192+
193+
- **No More Waiting** - Previously, if you wanted to use a compressed AI model, you had to wait for your computer to compress it first (which could take a while). With UQFF, someone already did the compression work for you - you just download and use it.
194+
195+
- **Works with Many Types** - It supports different compression methods (they have nerdy names like Q4_0, Q8_1, etc.) but basically just think of them as different quality/speed settings.
196+
197+
### What is GGUF?
198+
199+
GGUF stands for "GGML Universal File" (or sometimes "Generic GPT Unified Format") - it's a way to store AI models that makes them run faster and use less memory on regular computers like yours. It's essentially a special compression method that squishes models down so they can run on your laptop or desktop computer instead of needing a supercomputer.
200+
201+
#### What GGUF Does
202+
203+
- Compresses big AI models so they can run on CPUs or low-power devices
204+
- Enables running complex models on everyday hardware like CPUs
205+
- Optimized for quick loading and saving of models, making it highly efficient for inference purposes
206+
207+
#### Advantages
208+
209+
- One file format, one compression method
210+
- Very popular and widely supported
211+
- Works great, but limited to just GGUF-style compression
212+
213+
# Crates you should know
214+
215+
- [https://crates.io/crates/candle-core](https://crates.io/crates/candle-core): Minimalist ML framework
216+
- [https://crates.io/crates/tch](https://crates.io/crates/tch): Rust wrappers for the PyTorch C++ api (libtorch)
217+
- [https://github.com/EricLBuehler/mistral.rs](https://github.com/EricLBuehler/mistral.rs): It's not a crate but you can still add it as a dependency using the GitHub URL for the repo
218+
219+
```
220+
mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git" }
221+
```

static/candle-2025-07-09.pdf

630 KB
Binary file not shown.

0 commit comments

Comments
 (0)