Skip to content

Eval bug: --mmproj loads model on first avail GPU only, model loading not balanced across all available tensor/GPU devices #15061

@git2212

Description

@git2212

Name and Version

build: 5648

Operating systems

Linux

GGML backends

CUDA

Hardware

RTX 30xx and 50xx series

Models

-s -m
/models/Qwen3-Coder-30B-A3B-Instruct-Q5_K_M.gguf --mmproj
/models/mmproj-Magistral-Small-2506-F16.gguf --cache-type-k f16
--log-verbose --log-colors --metrics --host 0.0.0.0 --port 8080
--threads 6 --numa distribute --n-gpu-layers 128 --no-mmap --ctx-size
22000 --keep 4096 --temp 0.65 --batch-size 2048 --ubatch-size 256
--n-predict 3072 --parallel 1 --mirostat 1 --frequency-penalty 1.35
--repeat-last-n 256 --presence-penalty 1.85 --min-p 0.10 --top-p 0.80
--top-k 40 --jinja

Problem description & steps to reproduce

When specifying --mmproj /models/mmproj-Magistral-Small-2506-F16.gguf --cache-type-k f16 the first GPU tends to get overused. as the mm model is loaded on one GPU only. This will quickly escalate to an OOM on GPU 0. No adequate balancing is done across all GPUs, the other have a lot of unused VRAM space.

First Bad Commit

No response

Relevant log output

llama-cpp-openai_3  | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 837.36 MiB on device 0: cudaMalloc failed: out of memory
llama-cpp-openai_3  | alloc_tensor_range: failed to allocate CUDA0 buffer of size 878039040
llama-cpp-openai_3 exited with code 139

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions