Skip to content

flox/vram-optimizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vram-optimizer

vram-optimizer is a Rust CLI that recommends GPU memory parameters for single-GPU local LLM inference. It supports two engines:

  • llamacpp: recommends gpu_layers and ctx_size
  • vllm: recommends gpu_memory_utilization and max_model_len

The CLI is intended to run before an inference server starts. A wrapper should pass VRAM totals from nvidia-smi on Linux. On macOS, a wrapper can bypass this binary.

Build

cargo build --release

Example

vram-optimizer \
  --engine llamacpp \
  --model ./Qwen3-Coder-Next-Q4_K_M.gguf \
  --vram-total-mib 32607 \
  --vram-used-mib 163 \
  --safety-margin-mib 500 \
  --kv-cache-type f16 \
  --context-priority balanced

Output defaults to machine-readable key=value pairs:

engine=llamacpp
gpu_layers=31
ctx_size=65536
vram_model_mib=30840
vram_kv_cache_mib=448
vram_total_estimated_mib=31288
vram_headroom_mib=656

Use --json for structured output.

Hugging Face models

For remote models, pass a Hub repo id as --model. The tool fetches config.json and the model listing. Private or gated repos can use, in order:

  1. --hf-token <TOKEN>
  2. HF_TOKEN=<TOKEN>
  3. an encrypted cached token at $FLOX_ENV_CACHE/vram-optimizer/hf-token.enc
  4. interactive token entry when VRAM_OPTIMIZER_INTERACTIVE=1 is set

When interactive mode receives a nonblank token, the CLI encrypts it with ChaCha20-Poly1305 and writes the encrypted payload to $FLOX_ENV_CACHE/vram-optimizer/hf-token.enc. It also creates a cache-local 32-byte key at $FLOX_ENV_CACHE/vram-optimizer/hf-token.key. On Unix systems, the cache directory is chmod 0700 and both files are chmod 0600.

If $FLOX_ENV_CACHE is not set, interactive nonblank token entry fails with a clear diagnostic. Passing --hf-token or setting HF_TOKEN never writes token material to disk.

Model file selection

vram-optimizer never sums every file with a matching extension. It selects one base model artifact set before estimating VRAM:

  • local or remote GGUF inputs select one GGUF file, or one complete split-GGUF shard set such as name-00001-of-00003.gguf through name-00003-of-00003.gguf
  • vLLM safetensors uses *.safetensors.index.json when present and sums the unique shard files listed in weight_map
  • vLLM PyTorch binary weights use *.bin.index.json when present and sum the unique shard files listed in weight_map
  • without an index file, vLLM accepts exactly one unsharded file or one complete shard set for the preferred format; safetensors is preferred over .bin
  • adapter, LoRA, optimizer, scheduler, training, and RNG-state files are ignored

If a directory or repo contains multiple base model variants and --quant does not select one of them, the CLI exits with an ambiguity diagnostic listing the candidate artifact sets.

Notes on estimation

The optimizer favors configurations that leave the requested safety margin. For llamacpp, it parses the GGUF tensor table, separates non-layer tensors from per-layer tensors by tensor name, and models partial layer offload plus KV cache cost. For remote GGUF models, it uses bounded HTTP range reads against the selected GGUF shard headers to parse tensor metadata without downloading model weights. For vllm, it models fully resident weights, an activation overhead allowance, and the KV budget available under gpu_memory_utilization.

For llamacpp, GGUF tensor metadata can override the generic metadata KV formula. When the selected GGUF exposes separate per-layer K/V projection tensors, the optimizer derives the cache width from those tensor dimensions and sums only the KV widths for GPU-resident layers. This matches llama.cpp's allocation model, which allocates cache_k_lN and cache_v_lN tensors from each layer's n_embd_k_gqa and n_embd_v_gqa.

When tensor-derived K/V widths are unavailable, the conservative fallback is:

attn_layers * ctx_size * (key_length + value_length) * head_count_kv * bytes_per_element

No extra K/V multiplier is applied after (key_length + value_length).

Exact context search

The optimizer does not use a fixed grid of context buckets. For llama.cpp, VRAM cost is linear in context for a fixed gpu_layers value, so the tool computes the maximum feasible ctx_size for every layer-count candidate, filters dominated points, and scores the resulting Pareto frontier. For vLLM, it computes the maximum feasible max_model_len directly from the KV-cache budget.

This lets the CLI return boundary values such as 40123 when that is the largest value that fits, instead of rounding down to the nearest common bucket.

Test

cargo test

The test suite includes the supplied RTX 5090 / Qwen3-Coder-Next validation case and fixed-parameter validation paths.

Model training context ceiling

When model metadata reports a training context, vram-optimizer caps recommendations at the smaller of --max-ctx-size and the model's context metadata. This applies to both llamacpp ctx_size and vllm max_model_len. A --fixed-ctx-size above that ceiling fails with a diagnostic.

Context priority

The --context-priority flag controls the tradeoff between GPU layers (speed) and context window (capacity):

Mode Behavior
balanced (default) Finds the sweet spot. Context utility has diminishing returns above 65K. Capped at --max-ctx-size (default 131K).
prefer-context Heavier weight on context, still uses diminishing returns. Ceiling lifts to model's training context when --max-ctx-size is not set.
max-context Linear context scoring — every token has equal value. Ceiling lifts to model's training context. Will aggressively trade GPU layers for context.

Example on RTX 5090 with Qwen3-Coder-Next Q4_K_M (48 layers, 256K training context):

balanced:        gpu_layers=30  ctx_size=73120    (sweet spot)
prefer-context:  gpu_layers=30  ctx_size=73120    (same — diminishing returns prevent shift)
max-context:     gpu_layers=27  ctx_size=262144   (full training context, trades 3 layers)

When --max-ctx-size is explicitly set, it is always respected as a hard cap regardless of priority mode.

Testing

The test suite covers optimizer arithmetic, GGUF parser behavior, malformed GGUF diagnostics, Hugging Face API mocking, CLI parsing, output stability, token cache behavior, file-selection edge cases, and binary integration behavior.

Run locally with:

cargo fmt --check
cargo clippy --all-targets --all-features -- -D warnings
cargo test --all-targets --all-features

VRAM_OPTIMIZER_HF_ENDPOINT is used only by tests to redirect Hugging Face requests to a local mock server.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors