Eval bug: --mmproj loads model on first avail GPU only, model loading not balanced across all available tensor/GPU devices

### Name and Version

build: 5648

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

RTX 30xx and 50xx series

### Models

-s -m
        /models/Qwen3-Coder-30B-A3B-Instruct-Q5_K_M.gguf --mmproj
        /models/mmproj-Magistral-Small-2506-F16.gguf --cache-type-k f16
        --log-verbose --log-colors --metrics --host 0.0.0.0 --port 8080
        --threads 6 --numa distribute --n-gpu-layers 128 --no-mmap --ctx-size
        22000 --keep 4096 --temp 0.65 --batch-size 2048 --ubatch-size 256
        --n-predict 3072 --parallel 1 --mirostat 1 --frequency-penalty 1.35
        --repeat-last-n 256 --presence-penalty 1.85 --min-p 0.10 --top-p 0.80
        --top-k 40 --jinja

### Problem description & steps to reproduce

When specifying --mmproj  /models/mmproj-Magistral-Small-2506-F16.gguf --cache-type-k f16 the first GPU tends to get overused. as the mm model is loaded on one GPU only. This will quickly escalate to an OOM on GPU 0. No adequate balancing is done across all GPUs, the other have a lot of unused VRAM space.

### First Bad Commit

_No response_

### Relevant log output

```shell
llama-cpp-openai_3  | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 837.36 MiB on device 0: cudaMalloc failed: out of memory
llama-cpp-openai_3  | alloc_tensor_range: failed to allocate CUDA0 buffer of size 878039040
llama-cpp-openai_3 exited with code 139
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: --mmproj loads model on first avail GPU only, model loading not balanced across all available tensor/GPU devices #15061

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: --mmproj loads model on first avail GPU only, model loading not balanced across all available tensor/GPU devices #15061

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions