Skip to content

firstbatchxyz/dnet

Repository files navigation

logo

dnet

Distributed LLM Inference for Apple Silicon Clusters

License: Apache-2.0 Workflow: Tests License: Apache-2.0

RUN BIG MODELS | RUN LONG CONTEXT | MAXIMIZE UTILIZATION

dnet runs LLMs across Apple Silicon devices. Modular execution strategies, automatic device profiling, drop-in OpenAI API.

Features

  • Execution

    • No Memory Ceiling: Run models that exceed total cluster memory—compute/I/O overlap keeps data flowing
    • UMA specific: Designed for Apple Silicon's unified memory for efficient layer swapping
    • OpenAI-Compatible: Drop-in /v1/chat/completions endpoint
  • Cluster Management

    • Automatic Discovery: Nodes find each other; no manual topology configuration
    • Thunderbolt Detection: Automatically utilizes Thunderbolt for high-bandwidth inter-device communication
  • Workload Assignment

    • Device Profiling: Measures FLOPs, memory, and inter-device latency per node
    • Model Profiling: Analyzes compute and memory requirements per layer
    • Heterogeneity-Aware Solver: Topology aware assignment that accounts for device capability, network speed, KV cache size, and disk speed
  • Pipelined-ring - Run >32B 8-bit models across devices with insufficient total memory

  • 🚧 Long context - Make >128K context windows a reality for home clusters

  • 🚧 High throughput - Maximize throughput via tensor parallelism

  • 🚧 Unified backend - A single optimized backend for Apple Silicon, NVIDIA, and AMD (currently Apple Silicon only, via MLX)

Installation

dnet requires several submodules, which can all be cloned with the following command:

git clone --recurse-submodules https://github.com/firstbatchxyz/dnet.git

dnet uses uv, so make sure it is installed. You can check for uv with the command below, and follow the installation guide if you do not have it.

uv --version

dnet currently only supports MLX on Apple Silicon. To install, run:

uv sync --extra mac

After syncing dependencies, run the one-time setup to install Git hooks and generate protos:

make init

This will:

  • Install pre-commit hooks for automatic code quality checks
  • Generate protobuf files

The pre-commit hooks will automatically run ruff formatting, ruff linting, and mypy type checking before each commit.

Development

Git Hooks

This project uses pre-commit to ensure code quality. Hooks are installed automatically when you run make init, but you can also manage them manually:

# Install hooks
make hooks-install

# Run all hooks on all files
make hooks-run

# Update hook versions
make hooks-update

The hooks will run automatically on git commit, checking:

  • Code formatting (ruff format)
  • Linting (ruff check)
  • Type checking (mypy)

Usage

dnet uses a dynamic topology approach where nodes start without models, then the API discovers devices and distributes layers optimally using distilp.

  1. Start Shards: Launch shard nodes on each device.
  2. Start API: Launch the API node, one of the shards SHOULD reside in the same device.
  3. Prepare Topology: API discovers devices and solves for optimal layer distribution.
  4. Load Model: API instructs shards to load their assigned layers.
  5. Inference: Use /v1/chat/completions endpoint for generation.

See catalog for supported models.

image of dnet TUI

Viewing dnet TUI

dnet comes with a TUI built in Rust, providing a neat interface for you to load models, view the topology and chat with the loaded models.

Install the TUI with:

cargo install --git https://github.com/firstbatchxyz/dnet-tui.git

Then simply run with:

dnet-tui

For more details, check out the repository.

Running a Shard

Start a shard node with gRPC and HTTP ports:

uv run dnet-shard --http-port 8081 --grpc-port 58081

Each shard should be started on a different device and with a different port (try increment by one for each shard), like the following:

uv run dnet-shard --http-port 8082 --grpc-port 58082

You can optionally specify a custom shard name for better identification in discovery, TUI, and logs:

uv run dnet-shard --http-port 8081 --grpc-port 58081 --shard-name my-shard-1

Warning

Each shard name must be unique within the same network. Using duplicate shard names will cause discovery conflicts and connectivity issues.

Running an API

Start the API node:

uv run dnet-api --http-port 8080 --grpc-port 58080

To do inference, first, we must prepare the topology (discover nodes) and then load the model itself. After that, we can call the completions endpoint as usual.

Tip

We have a script that can prepare the model and load it at once:

uv run ./scripts/prepare_model.py Qwen/Qwen3-4B-MLX-4bit

Prepare Topology

Discover devices and compute optimal layer distribution:

curl -X POST http://localhost:8080/v1/prepare_topology \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-4B-MLX-4bit"
  }'

Response will be the otpimal topology (as given by the solver) for the discovered devices.

Note

Once the topology is prepared, you can fetch it after via the /topology endpoint:

curl http://localhost:8080/v1/topology \
 -H "Content-Type: application/json" \

Load Model

Load the model on shards with prepared topology:

curl -X POST http://localhost:8080/v1/load_model \
  -H "Content-Type: application/json" \
  -d $OUTPUT_FROM_PREPARE_TOPOLOGY

a shard with a loaded model

Chat Completions

Generate text using the loaded model:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen2.5-0.5B-Instruct-4bit",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 100
  }'

Devices

You can get the list of discoverable devices with:

curl http://localhost:8080/v1/devices \
  -H "Content-Type: application/json"

Configuration (.env)

dnet supports configuration via a .env file in the project root. This allows you to set environment variables for logging, profiling, and other runtime options without modifying code or command-line arguments.

Example .env

# Set logging level (e.g., DEBUG, INFO, WARNING, ERROR)
LOG_LEVEL=INFO

# Enable profiling (set to 1 to enable)
PROFILE=0

# Add other environment variables as needed

Please see .env.example for a complete example. The .env file is automatically loaded when running via uv run or using the provided Makefile targets. This ensures consistent configuration in both local development and CI environments.

For more details, see the relevant sections in the Makefile and CI workflow.

Testing

Before testing make sure to install dev path

uv sync --extra dev --extra mac

You can run Pytest tests via:

uv run pytest -v

For code quality checks (linting, formatting, type checking), see the Development section above.

Tip

If you are using VsCode, we have prepared tasks that you can run easily from the Command Palette > Tasks: Run Task .

Acknowledgements

dnet is built on top of MLX and inspired by pioneering work in distributed inference:

PRIMA.CPP: Prima.cpp: Fast 30-70B LLM Inference on Heterogeneous and Low-Resource Home Clusters

Exo: Run your own AI cluster at home with everyday devices

Petals: Collaborative Inference for Large Language Models

License

You can find the license here.

Cite

If you have used this work please feel free to cite us!

About

Distributed LLM Inference for Apple Silicon Clusters

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 6

Languages