azdevs
diff --git a/‎content/2025-07-09.md‎
Lines changed: 221 additions & 0 deletions b/‎content/2025-07-09.md‎
Lines changed: 221 additions & 0 deletions
diff --git a/‎static/candle-2025-07-09.pdf‎
630 KB b/‎static/candle-2025-07-09.pdf‎
630 KB
@@ -0,0 +1,221 @@
++++
+title = "Rust <> AI"
+date = 2025-07-09
+draft = false
+in_search_index = true
+template = "page.html"
+
+[taxonomies]
+tags = ["Rust", "AI", "LLMs", "Hugging Face"]
+categories = ["meetups"]
++++
+
+![Llama Crab Emoji](https://github.com/azdevs/desert-rustaceans/raw/master/static/emojis/rust_llama.png)
+
+Topics:
+
+- Candle a Minimalist ML Framework via [@MasonStallmo](https://github.com/mstallmo)
+- 🦙🦀 Tauri-Served Local LLMs with Mistral.rs via [@DanielPBank](https://github.com/danielbank)
+
+<!-- more -->
+
+# Candle a Minimalist ML Framework via [@MasonStallmo](https://github.com/mstallmo)
+
+Slides: [local_first_security.pdf](https://github.com/azdevs/desert-rustaceans/raw/master/static/candle-2025-07-09.pdf)
+
+Candle is a lightweight, fast ML framework written in Rust that aims to provide a familiar PyTorch-like experience while addressing the performance and deployment limitations of Python-based frameworks. Created by Hugging Face, Candle is designed to be particularly well-suited for serverless ML deployments and browser-based inference.
+
+## Key Features and Performance
+
+Candle significantly outperforms traditional Python frameworks in both memory usage and speed. According to benchmark data, Candle uses only 3.2GB peak RAM compared to torch-rs's 4.7GB, with substantially lower memory growth (18 MB/min vs 42 MB/min). Performance benchmarks show Candle executing BERT Base in 8.3ms compared to torch-rs's 15.7ms, and LLaMA 2 7B inference at 45.2ms/token versus 72.8ms/token.
+
+```rust
+// Example: Basic tensor operations in Candle
+use candle_core::{Device, Tensor};
+
+let device = Device::Cpu;
+let a = Tensor::new(&[[1f32, 2.], [3., 4.]], &device)?;
+let b = Tensor::new(&[[5f32, 6.], [7., 8.]], &device)?;
+let result = a.matmul(&b)?;
+```
+
+## Unique Advantages
+
+The framework's most distinctive feature is its ability to compile models to WebAssembly (WASM), enabling ML inference directly in browsers without server dependencies—something impossible with PyTorch. This makes Candle particularly valuable for privacy-conscious applications and edge computing scenarios. The framework provides all essential ML components including model structure, weight serialization, training capabilities with optimizers and data loaders, backpropagation, and inference engines.
+
+```rust
+// Example: MNIST training loop structure
+for epoch in 1..=epochs {
+    let mut sum_loss = 0f32;
+    for (bimages, blabels) in train_iter {
+        let logits = model.forward(&bimages)?;
+        let loss = loss::cross_entropy(&logits, &blabels)?;
+        optimizer.backward_step(&loss)?;
+        sum_loss += loss.to_vec0::<f32>()?;
+    }
+}
+```
+
+While Candle trades some of PyTorch's extensive feature set for simplicity and performance, it maintains familiar APIs that make it approachable for PyTorch users. The framework supports popular models and can be explored through various [Hugging Face demos](https://huggingface.co/spaces/lmz/candle-yolo) showcasing YOLO, Whisper, LLaMA 2, and other models running entirely in the browser.
+
+# 🦙🦀 Tauri-Served Local LLMs with Mistral.rs via [@DanielPBank](https://github.com/danielbank)
+
+Repo: [https://github.com/danielbank/tauri-mistral-chat](https://github.com/danielbank/tauri-mistral-chat)
+
+Daniel built a simple desktop chatbot demo with [Tauri](https://v2.tauri.app/) (a cross-platform framework), React (frontend JS framework), and [mistral.rs](https://github.com/EricLBuehler/mistral.rs) (a cross-platform, highly-multimodal inference engine written in Rust). The demo integrates Mistral AI models for local inference.
+
+## Hidden Gems in the Mistral.rs Documentation
+
+The [mistral.rs Docs](https://ericlbuehler.github.io/mistral.rs/mistralrs/) can be a little hard to navigate. Here are a few things Daniel found REALLY HELPFUL:
+
+### Rust Examples!
+
+The [Rust examples](https://github.com/EricLBuehler/mistral.rs/tree/master/mistralrs/examples) are all here and are a good starting point for simple programs that demonstrate the models
+
+### ❗ Chat Templates (IMPORTANT)
+
+You will need to specify [a chat template](https://github.com/EricLBuehler/mistral.rs/tree/master/chat_templates) ([e.g. `mistral.json`](https://github.com/danielbank/tauri-mistral-chat/blob/main/src-tauri/examples/hello_world.rs#L187-L192)) with your model builder:
+
+```rs
+   builder = builder.with_chat_template(template_path);
+```
+
+These templates are readily available in the mistral.rs repo, but you have to look for them: https://github.com/EricLBuehler/mistral.rs/tree/master/chat_templates
+
+As an aside, you can use remote tokenizer as backup: `.with_tok_model_id("mistralai/Mistral-7B-Instruct-v0.1")`
+
+## Model Management
+
+### Downloading Models
+
+The demo features two examples in `./src-tauri` which are meant to demonstrate Mistral.rs inference with a Local LLM without the added complexity of Tauri. The first example downloads relevant models and the second example runs the inference:
+
+```bash
+cd src-tauri
+cargo run --example download-models list
+cargo run --example download_models download llama-vision --force --yes
+cargo run --example hello_world
+```
+
+### Instantiating a Model
+
+Each model type has a different builder. For example, the `TextModelBuilder` is used for text-based models:
+
+```rs
+async fn load_remote_smollm3_model() -> Result<mistralrs::Model, String> {
+    println!("Loading remote SmolLM3 3B model...");
+
+    // Build the remote SmolLM3 model using TextModelBuilder
+    let model = TextModelBuilder::new("HuggingFaceTB/SmolLM3-3B")
+        .with_isq(IsqType::Q8_0)
+        .with_logging()
+        .build()
+        .await
+        .map_err(|e: anyhow::Error| format!("Failed to build remote SmolLM3 model: {}", e))?;
+
+    println!("Remote SmolLM3 model loaded successfully!");
+    Ok(model)
+}
+```
+
+Different formats of the same model type also have different builders. For example, to load a text-model using the UQFF format, you would use the `UqffTextModelBuilder`:
+
+```rs
+async fn load_remote_llama_uqff_model() -> Result<mistralrs::Model, String> {
+    println!("Loading remote Llama 3B UQFF model...");
+
+let model = UqffTextModelBuilder::new(&model_path, uqff_files)
+    .into_inner()
+    .with_isq(IsqType::Q5_0)
+    .with_logging()
+    .build()
+    .await
+    .map_err(|e: anyhow::Error| format!("Failed to build Llama UQFF text model: {}", e))?;
+
+println!("Llama UQFF text model loaded successfully!");
+return Ok(model);
+```
+
+### Inference
+
+Once you have a model, you can run inference with the `Model` struct. For example, to run inference with a text-model, you would use the `Model::chat` method:
+
+```rs
+async fn run_inference(model: mistralrs::Model, message: String) -> Result<String, String> {
+    let messages = TextMessages::new()
+        .add_message(
+            TextMessageRole::User,
+            &format!("You are a helpful AI assistant. Keep your responses concise and friendly.\n\n{}", message)
+        );
+
+    model
+        .send_chat_request(messages)
+        .await
+        .map_err(|e| format!("Failed to send text chat request: {}", e))?
+}
+```
+
+## Integration with the Frontend
+
+The Tauri application exposes commands to the frontend JavaScript to run the inference: [discover_models](https://github.com/danielbank/tauri-mistral-chat/blob/main/src-tauri/src/lib.rs#L59-L174) and [ai_chat](https://github.com/danielbank/tauri-mistral-chat/blob/main/src-tauri/src/lib.rs#L304-L384). The frontend is a simple React app that uses the Tauri API to call the Rust functions:
+
+```javascript
+// Call Tauri backend
+console.log("Calling Tauri backend with:");
+console.log("- message:", message.content);
+console.log("- modelId:", modelId);
+console.log("- hasImage:", !!imageData);
+console.log("- imageDataLength:", imageData?.length || 0);
+
+const response =
+  (await invoke) <
+  string >
+  ("ai_chat",
+  {
+    message: message.content,
+    modelId: modelId,
+    imageData: imageData,
+  });
+
+console.log("Received response from Tauri backend:", response);
+```
+
+## Universal Quantized File Format (UQFF) and GGML Universal File (GGUF) Format
+
+### What is UQFF?
+
+Think of UQFF as a new way to package AI models so they run faster and use less computer memory. It's like having a ZIP file specifically designed for AI models. Specifically, it uses a technique called "quantization" to compress AI models to make them smaller and faster - kind of like how you might compress a video file to make it smaller.
+
+#### What Makes UQFF Special
+
+- **One File, Multiple Options** - Instead of having separate files for different compression levels, UQFF lets you pack multiple compression types into one file. It's like having a ZIP file that contains both the HD version and the compressed version of a movie.
+
+- **No More Waiting** - Previously, if you wanted to use a compressed AI model, you had to wait for your computer to compress it first (which could take a while). With UQFF, someone already did the compression work for you - you just download and use it.
+
+- **Works with Many Types** - It supports different compression methods (they have nerdy names like Q4_0, Q8_1, etc.) but basically just think of them as different quality/speed settings.
+
+### What is GGUF?
+
+GGUF stands for "GGML Universal File" (or sometimes "Generic GPT Unified Format") - it's a way to store AI models that makes them run faster and use less memory on regular computers like yours. It's essentially a special compression method that squishes models down so they can run on your laptop or desktop computer instead of needing a supercomputer.
+
+#### What GGUF Does
+
+- Compresses big AI models so they can run on CPUs or low-power devices
+- Enables running complex models on everyday hardware like CPUs
+- Optimized for quick loading and saving of models, making it highly efficient for inference purposes
+
+#### Advantages
+
+- One file format, one compression method
+- Very popular and widely supported
+- Works great, but limited to just GGUF-style compression
+
+# Crates you should know
+
+- [https://crates.io/crates/candle-core](https://crates.io/crates/candle-core): Minimalist ML framework
+- [https://crates.io/crates/tch](https://crates.io/crates/tch): Rust wrappers for the PyTorch C++ api (libtorch)
+- [https://github.com/EricLBuehler/mistral.rs](https://github.com/EricLBuehler/mistral.rs): It's not a crate but you can still add it as a dependency using the GitHub URL for the repo
+
+```
+mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git" }
+```