Distributed LLM Inference for Apple Silicon Clusters
RUN BIG MODELS | RUN LONG CONTEXT | MAXIMIZE UTILIZATION
dnet runs LLMs across Apple Silicon devices. Modular execution strategies, automatic device profiling, drop-in OpenAI API.
-
Execution
- No Memory Ceiling: Run models that exceed total cluster memory—compute/I/O overlap keeps data flowing
- UMA specific: Designed for Apple Silicon's unified memory for efficient layer swapping
- OpenAI-Compatible: Drop-in
/v1/chat/completionsendpoint
-
Cluster Management
- Automatic Discovery: Nodes find each other; no manual topology configuration
- Thunderbolt Detection: Automatically utilizes Thunderbolt for high-bandwidth inter-device communication
-
Workload Assignment
- Device Profiling: Measures FLOPs, memory, and inter-device latency per node
- Model Profiling: Analyzes compute and memory requirements per layer
- Heterogeneity-Aware Solver: Topology aware assignment that accounts for device capability, network speed, KV cache size, and disk speed
-
✅ Pipelined-ring - Run >32B 8-bit models across devices with insufficient total memory
-
🚧 Long context - Make >128K context windows a reality for home clusters
-
🚧 High throughput - Maximize throughput via tensor parallelism
-
🚧 Unified backend - A single optimized backend for Apple Silicon, NVIDIA, and AMD (currently Apple Silicon only, via MLX)
dnet requires several submodules, which can all be cloned with the following command:
git clone --recurse-submodules https://github.com/firstbatchxyz/dnet.gitdnet uses uv, so make sure it is installed. You can check for uv with the command below, and follow the installation guide if you do not have it.
uv --versiondnet currently only supports MLX on Apple Silicon. To install, run:
uv sync --extra macAfter syncing dependencies, run the one-time setup to install Git hooks and generate protos:
make initThis will:
- Install pre-commit hooks for automatic code quality checks
- Generate protobuf files
The pre-commit hooks will automatically run ruff formatting, ruff linting, and mypy type checking before each commit.
This project uses pre-commit to ensure code quality. Hooks are installed automatically when you run make init, but you can also manage them manually:
# Install hooks
make hooks-install
# Run all hooks on all files
make hooks-run
# Update hook versions
make hooks-updateThe hooks will run automatically on git commit, checking:
- Code formatting (ruff format)
- Linting (ruff check)
- Type checking (mypy)
dnet uses a dynamic topology approach where nodes start without models, then the API discovers devices and distributes layers optimally using distilp.
- Start Shards: Launch shard nodes on each device.
- Start API: Launch the API node, one of the shards SHOULD reside in the same device.
- Prepare Topology: API discovers devices and solves for optimal layer distribution.
- Load Model: API instructs shards to load their assigned layers.
- Inference: Use
/v1/chat/completionsendpoint for generation.
See catalog for supported models.
dnet comes with a TUI built in Rust, providing a neat interface for you to load models, view the topology and chat with the loaded models.
Install the TUI with:
cargo install --git https://github.com/firstbatchxyz/dnet-tui.gitThen simply run with:
dnet-tuiFor more details, check out the repository.
Start a shard node with gRPC and HTTP ports:
uv run dnet-shard --http-port 8081 --grpc-port 58081Each shard should be started on a different device and with a different port (try increment by one for each shard), like the following:
uv run dnet-shard --http-port 8082 --grpc-port 58082You can optionally specify a custom shard name for better identification in discovery, TUI, and logs:
uv run dnet-shard --http-port 8081 --grpc-port 58081 --shard-name my-shard-1Warning
Each shard name must be unique within the same network. Using duplicate shard names will cause discovery conflicts and connectivity issues.
Start the API node:
uv run dnet-api --http-port 8080 --grpc-port 58080To do inference, first, we must prepare the topology (discover nodes) and then load the model itself. After that, we can call the completions endpoint as usual.
Tip
We have a script that can prepare the model and load it at once:
uv run ./scripts/prepare_model.py Qwen/Qwen3-4B-MLX-4bitDiscover devices and compute optimal layer distribution:
curl -X POST http://localhost:8080/v1/prepare_topology \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-4B-MLX-4bit"
}'Response will be the otpimal topology (as given by the solver) for the discovered devices.
Note
Once the topology is prepared, you can fetch it after via the /topology endpoint:
curl http://localhost:8080/v1/topology \
-H "Content-Type: application/json" \Load the model on shards with prepared topology:
curl -X POST http://localhost:8080/v1/load_model \
-H "Content-Type: application/json" \
-d $OUTPUT_FROM_PREPARE_TOPOLOGYGenerate text using the loaded model:
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen2.5-0.5B-Instruct-4bit",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"max_tokens": 100
}'You can get the list of discoverable devices with:
curl http://localhost:8080/v1/devices \
-H "Content-Type: application/json"dnet supports configuration via a .env file in the project root. This allows you to set environment variables for logging, profiling, and other runtime options without modifying code or command-line arguments.
# Set logging level (e.g., DEBUG, INFO, WARNING, ERROR)
LOG_LEVEL=INFO
# Enable profiling (set to 1 to enable)
PROFILE=0
# Add other environment variables as neededPlease see .env.example for a complete example. The .env file is automatically loaded when running via uv run or using the provided Makefile targets. This ensures consistent configuration in both local development and CI environments.
For more details, see the relevant sections in the Makefile and CI workflow.
Before testing make sure to install dev path
uv sync --extra dev --extra mac
You can run Pytest tests via:
uv run pytest -vFor code quality checks (linting, formatting, type checking), see the Development section above.
Tip
If you are using VsCode, we have prepared tasks that you can run easily from the Command Palette > Tasks: Run Task .
dnet is built on top of MLX and inspired by pioneering work in distributed inference:
PRIMA.CPP: Prima.cpp: Fast 30-70B LLM Inference on Heterogeneous and Low-Resource Home Clusters
Exo: Run your own AI cluster at home with everyday devices
Petals: Collaborative Inference for Large Language Models
You can find the license here.
If you have used this work please feel free to cite us!

