Mesh LLM lets you pool spare GPU capacity across machines and expose the result as one OpenAI-compatible API.
If a model fits on one machine, it runs there. If it does not, Mesh LLM automatically spreads the work across the mesh:
- Dense models use pipeline parallelism.
- MoE models use expert sharding with zero cross-node inference traffic.
- Every node gets the same local API at
http://localhost:9337/v1.
- Run models larger than a single machine can hold.
- Turn a few uneven boxes into one shared inference pool.
- Give agents a local OpenAI-compatible endpoint instead of wiring each tool by hand.
- Keep the setup simple: start one node, add more later.
Install the latest release:
curl -fsSL https://raw.githubusercontent.com/michaelneale/mesh-llm/main/install.sh | bashThen start a node:
mesh-llm serve --autoInspect local GPU identity:
mesh-llm gpusThat command:
- picks a suitable bundled backend for your machine
- downloads a model if needed
- joins the best public mesh
- exposes an OpenAI-compatible API at
http://localhost:9337/v1 - starts the web console at
http://localhost:3131
Check what is available:
curl -s http://localhost:9337/v1/models | jq '.data[].id'Send a request:
curl http://localhost:9337/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"GLM-4.7-Flash-Q4_K_M","messages":[{"role":"user","content":"hello"}]}'mesh-llm serve --autoThis is the easiest way to see the system working end to end.
mesh-llm serve --model Qwen2.5-32BThis starts serving a model, opens the local API and console, and prints an invite token for other machines.
git clone https://github.com/michaelneale/mesh-llm
cd mesh-llm
just buildRequires: just, cmake, Rust toolchain, Node.js 24 + npm. NVIDIA GPU builds need nvcc (CUDA toolkit). AMD GPU builds need ROCm/HIP. Vulkan GPU builds need the Vulkan development files plus glslc. CPU-only and Jetson/Tegra also work. For source builds, just build auto-detects CUDA vs ROCm vs Vulkan on Linux, or you can force backend=rocm or backend=vulkan. See CONTRIBUTING.md for details.
Windows source builds are also supported for cuda, rocm/hip, vulkan, and cpu via just build. Metal remains macOS-only. Tagged GitHub releases now publish Windows .zip bundles for cpu, cuda, rocm, and vulkan, and you can generate the same artifacts locally with just release-build-windows, just release-build-cuda-windows, just release-build-amd-windows, just release-build-vulkan-windows, and the matching release-bundle-*-windows recipes.
Once installed, you can run:
mesh-llm serve --auto # join the best public mesh, start servingThat's it. Downloads a model for your hardware, connects to other nodes, and gives you an OpenAI-compatible API at http://localhost:9337.
Or start your own:
mesh-llm serve --model Qwen2.5-32B # downloads model (~20GB), starts API + web console
mesh-llm serve --model Qwen2.5-3B # or a small model first (~2GB)Add another machine:
mesh-llm serve --join <token> # token printed by the first machineOr discover and join public meshes:
mesh-llm serve --auto # find and join the best mesh
mesh-llm client --auto # join as API-only client (no GPU)Every node gets an OpenAI-compatible API at http://localhost:9337/v1. Distribution is automatic — you just say mesh-llm serve --model X and the mesh figures out the best strategy:
- Model fits on one machine? → runs solo, full speed, no network overhead
- Dense model too big? → pipeline parallelism — layers split across nodes
- MoE model too big? → expert parallelism — experts split across nodes, zero cross-node traffic
If a node has enough VRAM, it always runs the full model. Splitting only happens when it has to. Currently using a lightly forked version of llama.cpp (see the Justfile for where it pulls branch from).
Pipeline parallelism — for dense models that don't fit on one machine, layers are distributed across nodes proportional to VRAM. llama-server runs on the highest-VRAM node and coordinates via RPC. Each rpc-server loads only its assigned layers from local disk. Latency-aware: peers are selected by lowest RTT first, with an 80ms hard cap — high-latency nodes stay in the mesh as API clients but don't participate in splits.
MoE expert parallelism — Mixture-of-Experts models (Qwen3-MoE, GLM, OLMoE, Mixtral, DeepSeek — increasingly the best-performing architectures) are auto-detected from the GGUF header. The mesh reads expert routing statistics to identify which experts matter most, then assigns each node an overlapping shard: a shared core of critical experts replicated everywhere, plus unique experts distributed across nodes. Each node gets a standalone GGUF with the full trunk + its expert subset and runs its own independent llama-server — zero cross-node traffic during inference. Sessions are hash-routed to nodes for KV cache locality.
Multi-model — different nodes serve different models simultaneously. The API proxy peeks at the model field in each request and routes to the right node via QUIC tunnel. /v1/models lists everything available.
Demand-aware rebalancing — a unified demand map tracks which models the mesh wants (from --model flags, API requests, and gossip). Demand signals propagate infectiously across all nodes and decay naturally via TTL. Standby nodes auto-promote to serve unserved models with active demand, or rebalance when one model is significantly hotter than others. When a model loses its last server, standby nodes detect it within ~60s.
Latency design — the key insight is that HTTP streaming is latency-tolerant while RPC is latency-multiplied. llama-server always runs on the same box as the GPU. The mesh tunnels HTTP, so cross-network latency only affects time-to-first-token, not per-token throughput. RPC only crosses the network for pipeline splits where the model physically doesn't fit on one machine.
- Zero-transfer GGUF loading —
SET_TENSOR_GGUFtells rpc-server to read weights from local disk. Dropped model load from 111s → 5s. - RPC round-trip reduction — cached
get_alloc_size, skip GGUF lookups for intermediates. Per-token round-trips: 558 → 8. - Direct server-to-server transfers — intermediate tensors pushed directly between rpc-servers via TCP, not relayed through the client.
- Speculative decoding — draft model runs locally on the host, proposes tokens verified in one batched forward pass. +38% throughput on code (75% acceptance).
mesh-llm serve --model Qwen2.5-32BStarts serving a model and prints an invite token. This mesh is private — only people you share the token with can join.
To make it public (discoverable by others via --auto):
mesh-llm serve --model Qwen2.5-32B --publishmesh-llm serve --join <token> # join with invite token (GPU node)
mesh-llm client --join <token> # join as API-only client (no GPU)mesh-llm serve --auto --model GLM-4.7-Flash-Q4_K_M --mesh-name "poker-night"Everyone runs the same command. First person creates it, everyone else discovers "poker-night" and joins automatically. --mesh-name implies --publish — named meshes are always published to the directory.
mesh-llm serve --auto # discover, join, and serve a model
mesh-llm client --auto # join as API-only client (no GPU)
mesh-llm discover # browse available meshes
mesh-llm gpus # inspect local GPUs and stable IDsmesh-llm serve --model Qwen2.5-32B --model GLM-4.7-Flash
# Route by model name
curl localhost:9337/v1/chat/completions -d '{"model":"GLM-4.7-Flash-Q4_K_M", ...}'Different nodes serve different models. The API proxy routes by the model field.
mesh-llm gpusPrints local GPU entries, backend device names, stable IDs, VRAM, and cached bandwidth if a benchmark fingerprint is already available.
mesh-llm serve can now load startup models from ~/.mesh-llm/config.toml:
version = 1
[gpu]
assignment = "auto"
[[models]]
model = "Qwen3-8B-Q4_K_M"
[[models]]
model = "bartowski/Qwen2.5-VL-7B-Instruct-GGUF/qwen2.5-vl-7b-instruct-q4_k_m.gguf"
mmproj = "bartowski/Qwen2.5-VL-7B-Instruct-GGUF/mmproj-f16.gguf"
ctx_size = 8192
[[plugin]]
name = "blackboard"
enabled = trueStart with the default config path:
mesh-llm serveIf no startup models are configured, mesh-llm serve prints a ⚠️ warning, shows help, and exits.
Or point at a different file:
mesh-llm serve --config /path/to/config.tomlPrecedence rules:
- Explicit
--modelor--ggufignores configured[[models]]. - Explicit
--ctx-sizeoverrides configuredctx_sizefor the selected startup models. - Plugin entries still live in the same file.
mesh-llm # no args — prints --help and exitsDoes not start the console or bind any ports. Use the CLI flags shown in --help to start or join a mesh.
To install it as a per-user background service:
curl -fsSL https://raw.githubusercontent.com/michaelneale/mesh-llm/main/install.sh | bash -s -- --serviceService installs are user-scoped:
- macOS installs a
launchdagent at~/Library/LaunchAgents/com.mesh-llm.mesh-llm.plist - Linux installs a
systemd --userunit at~/.config/systemd/user/mesh-llm.service - Shared environment config lives in
~/.config/mesh-llm/service.env - Startup models live in
~/.mesh-llm/config.toml
The two platforms handle launch startup the same way:
- macOS:
launchdruns~/.config/mesh-llm/run-service.sh, which loadsservice.envand executesmesh-llm serve. - Linux: the installer writes
mesh-llm servedirectly intoExecStart=in~/.config/systemd/user/mesh-llm.service.
The background service no longer stores custom startup args. Configure startup models in ~/.mesh-llm/config.toml instead.
service.env is optional and shared by both platforms. Use plain KEY=value lines, for example:
MESH_LLM_NO_SELF_UPDATE=1
If you edit the Linux unit manually, reload and restart it:
systemctl --user daemon-reload
systemctl --user restart mesh-llm.serviceOn Linux this is a user service, so if you want it to keep running after reboot before login, enable lingering once:
sudo loginctl enable-linger "$USER"mesh-llm serve --model Qwen2.5-32B # dashboard at http://localhost:3131Live topology, VRAM bars per node, model picker, built-in chat. Everything comes from /api/status (JSON) and /api/events (SSE).
mesh-llm supports multimodal requests on:
POST /v1/chat/completionsPOST /v1/responses
The console supports image, audio, and file attachments. Large attachments use request-scoped blob upload rather than permanent storage.
| Family / model type | Vision | Audio | Notes |
|---|---|---|---|
Qwen3-VL, Qwen3VL |
yes | no | Example: Qwen3VL-2B-Instruct-Q4_K_M |
Qwen2-VL, Qwen2.5-VL |
yes | no | Vision-capable Qwen VL families |
LLaVA, mllama, PaliGemma, Idefics, Molmo, InternVL, GLM-4V, Ovis, Florence |
yes | no | Detected as vision-capable families |
Qwen2-Audio |
no | yes | Audio-capable family |
SeaLLM-Audio |
no | yes | Audio-capable family |
Ultravox |
no | yes | Audio-capable family |
Omni |
no or metadata-dependent | yes | Example: Qwen2.5-Omni-3B-Q4_K_M |
Whisper |
no | yes | Audio-capable family |
Any GGUF with mmproj sidecar |
yes | depends | Strong local signal for vision support |
Any model with vision_config / vision token IDs |
yes | depends | Promoted by metadata |
Any model with audio_config / audio token IDs |
depends | yes | Promoted by metadata |
Generic multimodal, -vl, image, video, voice naming only |
likely | likely | Hint only, not a strong routing guarantee |
Notes:
yesmeans mesh-llm treats the model as runtime-capable for routing and UI.likelymeans mesh-llm shows a weaker hint but does not rely on it as a hard capability.- Mixed image+audio requests work only when the selected model/runtime actually supports both modalities.
- Non-goals:
POST /v1/audio/transcriptions,POST /v1/audio/speech, andv1/realtime.
For the full capability and transport details, see mesh-llm/docs/MULTI_MODAL.md.
Build-from-source and UI development instructions are in CONTRIBUTING.md.
mesh-llm exposes an OpenAI-compatible API on localhost:9337. Any tool that supports custom OpenAI endpoints works. /v1/models lists available models; the model field in requests routes to the right node.
For built-in launcher integrations (goose, claude):
- If a mesh is already running locally on
--port, it is reused. - If not,
mesh-llmauto-starts a background client node that auto-joins the mesh. - If
--modelis omitted, the launcher picks the strongest tool-capable model available on the mesh. - When the harness exits (e.g.
claudequits), the auto-started node is cleaned up automatically.
Goose is available as both CLI (goose session) and desktop app (Goose.app).
mesh-llm gooseUse a specific model (example: MiniMax):
mesh-llm goose --model MiniMax-M2.5-Q4_K_MThis command writes/updates ~/.config/goose/custom_providers/mesh.json and launches Goose.
- Start a mesh client:
mesh-llm client --auto --port 9337- Check what models are available:
curl -s http://localhost:9337/v1/models | jq '.data[].id'mesh-llm ships a built-in lemonade plugin that registers a local Lemonade Server as another OpenAI-compatible backend. For setup and verification steps, see docs/USAGE.md.
If you want the mesh to be discoverable via --auto, publish it:
mesh-llm serve --model Qwen2.5-32B --publishmesh-llm serve --join <token>Use mesh-llm client if the machine should join without serving a model:
mesh-llm client --join <token>mesh-llm serve --auto --model GLM-4.7-Flash-Q4_K_M --mesh-name "poker-night"Everyone runs the same command. The first node creates the mesh, the rest discover and join it automatically.
mesh-llm serve --model Qwen2.5-32B --model GLM-4.7-FlashRequests are routed by the model field:
curl localhost:9337/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"GLM-4.7-Flash-Q4_K_M","messages":[{"role":"user","content":"hello"}]}'Mesh LLM keeps the user-facing surface simple: talk to localhost:9337, pick a model, and let the mesh decide how to serve it.
- If a model fits on one machine, it runs there with no network overhead.
- If a dense model does not fit, layers are split across low-latency peers.
- If an MoE model does not fit, experts are split across nodes and requests are hash-routed for cache locality.
- Different nodes can serve different models at the same time.
Each node also exposes a management API and web console on port 3131.
The installer currently targets macOS and Linux release bundles. Windows coming soon.
To force a specific bundled flavor during install:
curl -fsSL https://raw.githubusercontent.com/michaelneale/mesh-llm/main/install.sh | MESH_LLM_INSTALL_FLAVOR=vulkan bashInstalled release bundles use flavor-specific llama.cpp binaries:
- macOS:
metal - Linux:
cpu,cuda,rocm,vulkan
To update a bundle install to the latest release:
mesh-llm updateIf you build from source, always use just:
git clone https://github.com/michaelneale/mesh-llm
cd mesh-llm
just buildRequirements and backend-specific build notes are in CONTRIBUTING.md.
When a node is running, open:
http://localhost:3131
The console shows live topology, VRAM usage, loaded models, and built-in chat. It is backed by /api/status and /api/events.
You can also try the hosted demo:
- docs/USAGE.md for service installs, model commands, storage, and runtime control
- docs/AGENTS.md for Goose, Claude Code, pi, OpenCode, curl, and blackboard usage
- docs/BENCHMARKS.md for benchmark numbers and context
- CONTRIBUTING.md for local development and build workflows
- PLUGINS.md for the plugin system and blackboard internals
- mesh-llm/README.md for Rust crate structure
- ROADMAP.md for future work
Join the #mesh-llm channel on the Goose Discord for discussion and support.
