vram-optimizer is a Rust CLI that recommends GPU memory parameters for
single-GPU local LLM inference. It supports two engines:
llamacpp: recommendsgpu_layersandctx_sizevllm: recommendsgpu_memory_utilizationandmax_model_len
The CLI is intended to run before an inference server starts. A wrapper should
pass VRAM totals from nvidia-smi on Linux. On macOS, a wrapper can bypass this
binary.
cargo build --releasevram-optimizer \
--engine llamacpp \
--model ./Qwen3-Coder-Next-Q4_K_M.gguf \
--vram-total-mib 32607 \
--vram-used-mib 163 \
--safety-margin-mib 500 \
--kv-cache-type f16 \
--context-priority balancedOutput defaults to machine-readable key=value pairs:
engine=llamacpp
gpu_layers=31
ctx_size=65536
vram_model_mib=30840
vram_kv_cache_mib=448
vram_total_estimated_mib=31288
vram_headroom_mib=656
Use --json for structured output.
For remote models, pass a Hub repo id as --model. The tool fetches
config.json and the model listing. Private or gated repos can use, in order:
--hf-token <TOKEN>HF_TOKEN=<TOKEN>- an encrypted cached token at
$FLOX_ENV_CACHE/vram-optimizer/hf-token.enc - interactive token entry when
VRAM_OPTIMIZER_INTERACTIVE=1is set
When interactive mode receives a nonblank token, the CLI encrypts it with
ChaCha20-Poly1305 and writes the encrypted payload to
$FLOX_ENV_CACHE/vram-optimizer/hf-token.enc. It also creates a cache-local
32-byte key at $FLOX_ENV_CACHE/vram-optimizer/hf-token.key. On Unix systems,
the cache directory is chmod 0700 and both files are chmod 0600.
If $FLOX_ENV_CACHE is not set, interactive nonblank token entry fails with a
clear diagnostic. Passing --hf-token or setting HF_TOKEN never writes token
material to disk.
vram-optimizer never sums every file with a matching extension. It selects one
base model artifact set before estimating VRAM:
- local or remote GGUF inputs select one GGUF file, or one complete split-GGUF
shard set such as
name-00001-of-00003.ggufthroughname-00003-of-00003.gguf - vLLM safetensors uses
*.safetensors.index.jsonwhen present and sums the unique shard files listed inweight_map - vLLM PyTorch binary weights use
*.bin.index.jsonwhen present and sum the unique shard files listed inweight_map - without an index file, vLLM accepts exactly one unsharded file or one complete
shard set for the preferred format; safetensors is preferred over
.bin - adapter, LoRA, optimizer, scheduler, training, and RNG-state files are ignored
If a directory or repo contains multiple base model variants and --quant does
not select one of them, the CLI exits with an ambiguity diagnostic listing the
candidate artifact sets.
The optimizer favors configurations that leave the requested safety margin. For
llamacpp, it parses the GGUF tensor table, separates non-layer tensors from
per-layer tensors by tensor name, and models partial layer offload plus KV cache
cost. For remote GGUF models, it uses bounded HTTP range reads against the
selected GGUF shard headers to parse tensor metadata without downloading model
weights. For vllm, it models fully resident weights, an activation overhead
allowance, and the KV budget available under gpu_memory_utilization.
For llamacpp, GGUF tensor metadata can override the generic metadata KV formula.
When the selected GGUF exposes separate per-layer K/V projection tensors, the
optimizer derives the cache width from those tensor dimensions and sums only the
KV widths for GPU-resident layers. This matches llama.cpp's allocation model,
which allocates cache_k_lN and cache_v_lN tensors from each layer's
n_embd_k_gqa and n_embd_v_gqa.
When tensor-derived K/V widths are unavailable, the conservative fallback is:
attn_layers * ctx_size * (key_length + value_length) * head_count_kv * bytes_per_element
No extra K/V multiplier is applied after (key_length + value_length).
The optimizer does not use a fixed grid of context buckets. For llama.cpp, VRAM cost is linear in context for a fixed gpu_layers value, so the tool computes the maximum feasible ctx_size for every layer-count candidate, filters dominated points, and scores the resulting Pareto frontier. For vLLM, it computes the maximum feasible max_model_len directly from the KV-cache budget.
This lets the CLI return boundary values such as 40123 when that is the largest value that fits, instead of rounding down to the nearest common bucket.
cargo testThe test suite includes the supplied RTX 5090 / Qwen3-Coder-Next validation case and fixed-parameter validation paths.
When model metadata reports a training context, vram-optimizer caps recommendations at the smaller of --max-ctx-size and the model's context metadata. This applies to both llamacpp ctx_size and vllm max_model_len. A --fixed-ctx-size above that ceiling fails with a diagnostic.
The --context-priority flag controls the tradeoff between GPU layers (speed) and context window (capacity):
| Mode | Behavior |
|---|---|
balanced (default) |
Finds the sweet spot. Context utility has diminishing returns above 65K. Capped at --max-ctx-size (default 131K). |
prefer-context |
Heavier weight on context, still uses diminishing returns. Ceiling lifts to model's training context when --max-ctx-size is not set. |
max-context |
Linear context scoring — every token has equal value. Ceiling lifts to model's training context. Will aggressively trade GPU layers for context. |
Example on RTX 5090 with Qwen3-Coder-Next Q4_K_M (48 layers, 256K training context):
balanced: gpu_layers=30 ctx_size=73120 (sweet spot)
prefer-context: gpu_layers=30 ctx_size=73120 (same — diminishing returns prevent shift)
max-context: gpu_layers=27 ctx_size=262144 (full training context, trades 3 layers)
When --max-ctx-size is explicitly set, it is always respected as a hard cap regardless of priority mode.
The test suite covers optimizer arithmetic, GGUF parser behavior, malformed GGUF diagnostics, Hugging Face API mocking, CLI parsing, output stability, token cache behavior, file-selection edge cases, and binary integration behavior.
Run locally with:
cargo fmt --check
cargo clippy --all-targets --all-features -- -D warnings
cargo test --all-targets --all-featuresVRAM_OPTIMIZER_HF_ENDPOINT is used only by tests to redirect Hugging Face requests to a local mock server.