You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When specifying --mmproj /models/mmproj-Magistral-Small-2506-F16.gguf --cache-type-k f16 the first GPU tends to get overused. as the mm model is loaded on one GPU only. This will quickly escalate to an OOM on GPU 0. No adequate balancing is done across all GPUs, the other have a lot of unused VRAM space.
First Bad Commit
No response
Relevant log output
llama-cpp-openai_3 | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 837.36 MiB on device 0: cudaMalloc failed: out of memory
llama-cpp-openai_3 | alloc_tensor_range: failed to allocate CUDA0 buffer of size 878039040
llama-cpp-openai_3 exited with code 139