Eval bug: FIT not calculating properly

### Name and Version

default option of fit=on (or manually set fit to on) reports that all buffers are 0.0 except the cpu context buffer, and crash. fit=off works fine. 

### Operating systems

Linux

### GGML backends

RPC

### Hardware

I am using 2 3090s with CUDA on system 1 and 1 amd 395+ with vulkan on system 2.

### Models

glm 4.6v at q6k

### Problem description & steps to reproduce

 ./build-rpc-cuda/bin/llama-server --rpc 192.168.1.139:50054 -m /dataset/model/zai-org_GLM-4.6V-Q6_K-00001-of-00003.gguf -ngl 999 -c 32768 --port 5001 --host 0.0.0.0 -ts 23,24,64 --api-key <key> --no-mmap -dev CUDA0,CUDA1,RPC0 --chat-template-kwargs '{"enable_thinking": false}' --mmproj /dataset/model/mmproj-zai-org_GLM-4.6V-f16.gguf -fit on --verbose

### First Bad Commit

_No response_

### Relevant log output

```shell
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no                                                                                                                                                                                            ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no                                                                                                                                                                                            ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 7472 (4d1316c44) with GNU 13.3.0 for Linux x86_64
system info: n_threads = 6, n_threads_batch = 6, total_threads = 20
                                                                                                                                                                                                                                      system_info: n_threads = 6 (n_threads_batch = 6) / 20 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
                                                                                                                                                                                                                                      init: api_keys: ****a                                                                                                                                                                                                                 init: using 19 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '/dataset/model/zai-org_GLM-4.6V-Q6_K-00001-of-00003.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: getting device memory data for initial parameters:
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 22100 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:02:00.0) - 23766 MiB free
llama_model_load_from_file_impl: using device RPC0 (192.168.1.139:50054) (unknown id) - 91878 MiB free
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 56 key-value pairs and 780 tensors from /dataset/model/zai-org_GLM-4.6V-Q6_K-00001-of-00003.gguf (version GGUF V3 (latest))                                                                 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = glm4moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 2
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.600000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 0.800000
llama_model_loader: - kv   5:                               general.name str              = GLM 4.6V
llama_model_loader: - kv   6:                           general.finetune str              = 4.6V
llama_model_loader: - kv   7:                           general.basename str              = GLM
llama_model_loader: - kv   8:                         general.size_label str              = 128x8.0B
llama_model_loader: - kv   9:                            general.license str              = mit
llama_model_loader: - kv  10:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  11:                          general.languages arr[str,2]       = ["zh", "en"]
llama_model_loader: - kv  12:                        glm4moe.block_count u32              = 46
llama_model_loader: - kv  13:                     glm4moe.context_length u32              = 131072
llama_model_loader: - kv  14:                   glm4moe.embedding_length u32              = 4096
llama_model_loader: - kv  15:                glm4moe.feed_forward_length u32              = 10944
llama_model_loader: - kv  16:               glm4moe.attention.head_count u32              = 96
llama_model_loader: - kv  17:            glm4moe.attention.head_count_kv u32              = 8
llama_model_loader: - kv  18:            glm4moe.rope.dimension_sections arr[i32,4]       = [8, 12, 12, 0]
llama_model_loader: - kv  19:                     glm4moe.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  20:   glm4moe.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  21:                  glm4moe.expert_used_count u32              = 8
llama_model_loader: - kv  22:                 glm4moe.expert_group_count u32              = 1
llama_model_loader: - kv  23:            glm4moe.expert_group_used_count u32              = 1
llama_model_loader: - kv  24:               glm4moe.attention.key_length u32              = 128
llama_model_loader: - kv  25:             glm4moe.attention.value_length u32              = 128
llama_model_loader: - kv  26:               glm4moe.rope.dimension_count u32              = 64
llama_model_loader: - kv  27:                       glm4moe.expert_count u32              = 128
llama_model_loader: - kv  28:         glm4moe.expert_feed_forward_length u32              = 1408
llama_model_loader: - kv  29:                glm4moe.expert_shared_count u32              = 1
llama_model_loader: - kv  30:          glm4moe.leading_dense_block_count u32              = 1
llama_model_loader: - kv  31:                 glm4moe.expert_gating_func u32              = 2                                                                                                                                         llama_model_loader: - kv  32:               glm4moe.expert_weights_scale f32              = 1.000000                                                                                                                                  llama_model_loader: - kv  33:                glm4moe.expert_weights_norm bool             = true                                                                                                                                      llama_model_loader: - kv  34:               glm4moe.nextn_predict_layers u32              = 0                                                                                                                                         llama_model_loader: - kv  35:                       tokenizer.ggml.model str              = gpt2                                                                                                                                      llama_model_loader: - kv  36:                         tokenizer.ggml.pre str              = glm4                                                                                                                                      llama_model_loader: - kv  37:                      tokenizer.ggml.tokens arr[str,151552]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  38:                  tokenizer.ggml.token_type arr[i32,151552]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...                                                                                                  llama_model_loader: - kv  39:                      tokenizer.ggml.merges arr[str,318088]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...                                                                                                            llama_model_loader: - kv  40:                tokenizer.ggml.eos_token_id u32              = 151329
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 151329
llama_model_loader: - kv  42:                tokenizer.ggml.bos_token_id u32              = 151331
llama_model_loader: - kv  43:                tokenizer.ggml.eot_token_id u32              = 151336
llama_model_loader: - kv  44:            tokenizer.ggml.unknown_token_id u32              = 151329
llama_model_loader: - kv  45:                tokenizer.ggml.eom_token_id u32              = 151338
llama_model_loader: - kv  46:                    tokenizer.chat_template str              = [gMASK]<sop>\n{%- if tools -%}\n<|syste...                                                                                                llama_model_loader: - kv  47:               general.quantization_version u32              = 2                                                                                                                                         llama_model_loader: - kv  48:                          general.file_type u32              = 18
llama_model_loader: - kv  49:                      quantize.imatrix.file str              = /models_out/GLM-4.6V-GGUF/zai-org_GLM...                                                                                                  llama_model_loader: - kv  50:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav5.txt                                                                                                      llama_model_loader: - kv  51:             quantize.imatrix.entries_count u32              = 502
llama_model_loader: - kv  52:              quantize.imatrix.chunks_count u32              = 802
llama_model_loader: - kv  53:                                   split.no u16              = 0
llama_model_loader: - kv  54:                        split.tensors.count i32              = 780
llama_model_loader: - kv  55:                                split.count u16              = 3                                                                                                                                         llama_model_loader: - type  f32:  321 tensors
llama_model_loader: - type q5_1:   23 tensors
llama_model_loader: - type q8_0:  158 tensors
llama_model_loader: - type q5_K:   46 tensors
llama_model_loader: - type q6_K:  232 tensors
print_info: file format = GGUF V3 (latest)                                                                                                                                                                                            print_info: file type   = Q6_K
print_info: file size   = 89.30 GiB (7.18 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 151363 '<|image|>' is not marked as EOG
load: control token: 151362 '<|end_of_box|>' is not marked as EOG
load: control token: 151361 '<|begin_of_box|>' is not marked as EOG
load: control token: 151349 '<|code_suffix|>' is not marked as EOG
load: control token: 151348 '<|code_middle|>' is not marked as EOG
load: control token: 151346 '<|end_of_transcription|>' is not marked as EOG
load: control token: 151343 '<|begin_of_audio|>' is not marked as EOG
load: control token: 151342 '<|end_of_video|>' is not marked as EOG
load: control token: 151341 '<|begin_of_video|>' is not marked as EOG
load: control token: 151338 '<|observation|>' is not marked as EOG
load: control token: 151333 '<sop>' is not marked as EOG
load: control token: 151331 '[gMASK]' is not marked as EOG
load: control token: 151330 '[MASK]' is not marked as EOG
load: control token: 151347 '<|code_prefix|>' is not marked as EOG
load: control token: 151360 '/nothink' is not marked as EOG
load: control token: 151337 '<|assistant|>' is not marked as EOG
load: control token: 151332 '[sMASK]' is not marked as EOG
load: control token: 151334 '<eop>' is not marked as EOG
load: control token: 151335 '<|system|>' is not marked as EOG
load: control token: 151336 '<|user|>' is not marked as EOG
load: control token: 151340 '<|end_of_image|>' is not marked as EOG
load: control token: 151339 '<|begin_of_image|>' is not marked as EOG
load: control token: 151364 '<|video|>' is not marked as EOG
load: control token: 151345 '<|begin_of_transcription|>' is not marked as EOG
load: control token: 151344 '<|end_of_audio|>' is not marked as EOG
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special_eom_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 151329 ('<|endoftext|>')
load:   - 151336 ('<|user|>')                                                                                                                                                                                                         load:   - 151338 ('<|observation|>')                                                                                                                                                                                                  load: special tokens cache size = 36                                                                                                                                                                                                  load: token to piece cache size = 0.9713 MB                                                                                                                                                                                           print_info: arch             = glm4moe                                                                                                                                                                                                print_info: vocab_only       = 0                                                                                                                                                                                                      print_info: no_alloc         = 1                                                                                                                                                                                                      print_info: n_ctx_train      = 131072                                                                                                                                                                                                 print_info: n_embd           = 4096                                                                                                                                                                                                   print_info: n_embd_inp       = 4096
print_info: n_layer          = 46
print_info: n_head           = 96
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0                                                                                                                                                                                                      print_info: n_embd_head_k    = 128                                                                                                                                                                                                    print_info: n_embd_head_v    = 128
print_info: n_gqa            = 12                                                                                                                                                                                                     print_info: n_embd_k_gqa     = 1024                                                                                                                                                                                                   print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00                                                                                                                                                                                                print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 10944
print_info: n_expert         = 128
print_info: n_expert_used    = 8
print_info: n_expert_groups  = 1                                                                                                                                                                                                      print_info: n_group_used     = 1
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 8
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned   = unknown
print_info: mrope sections   = [8, 12, 12, 0]
print_info: model type       = ?B
print_info: model params     = 106.85 B
print_info: general.name     = GLM 4.6V
print_info: vocab type       = BPE
print_info: n_vocab          = 151552
print_info: n_merges         = 318088
print_info: BOS token        = 151331 '[gMASK]'
print_info: EOS token        = 151329 '<|endoftext|>'
print_info: EOT token        = 151336 '<|user|>'
print_info: EOM token        = 151338 '<|observation|>'
print_info: UNK token        = 151329 '<|endoftext|>'
print_info: PAD token        = 151329 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151347 '<|code_prefix|>'
print_info: FIM SUF token    = 151349 '<|code_suffix|>'
print_info: FIM MID token    = 151348 '<|code_middle|>'
print_info: EOG token        = 151329 '<|endoftext|>'
print_info: EOG token        = 151336 '<|user|>'
print_info: EOG token        = 151338 '<|observation|>'
print_info: max token length = 1024
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: layer   0 assigned to device CUDA0, is_swa = 0                                                                                                                                                                          load_tensors: layer   1 assigned to device CUDA0, is_swa = 0                                                                                                                                                                          load_tensors: layer   2 assigned to device CUDA0, is_swa = 0                                                                                                                                                                          load_tensors: layer   3 assigned to device CUDA0, is_swa = 0                                                                                                                                                                          load_tensors: layer   4 assigned to device CUDA0, is_swa = 0                                                                                                                                                                          load_tensors: layer   5 assigned to device CUDA0, is_swa = 0                                                                                                                                                                          load_tensors: layer   6 assigned to device CUDA0, is_swa = 0                                                                                                                                                                          load_tensors: layer   7 assigned to device CUDA0, is_swa = 0                                                                                                                                                                          load_tensors: layer   8 assigned to device CUDA0, is_swa = 0                                                                                                                                                                          load_tensors: layer   9 assigned to device CUDA0, is_swa = 0
load_tensors: layer  10 assigned to device CUDA1, is_swa = 0
load_tensors: layer  11 assigned to device CUDA1, is_swa = 0
load_tensors: layer  12 assigned to device CUDA1, is_swa = 0
load_tensors: layer  13 assigned to device CUDA1, is_swa = 0
load_tensors: layer  14 assigned to device CUDA1, is_swa = 0
load_tensors: layer  15 assigned to device CUDA1, is_swa = 0                                                                                                                                                                          load_tensors: layer  16 assigned to device CUDA1, is_swa = 0                                                                                                                                                                          load_tensors: layer  17 assigned to device CUDA1, is_swa = 0
load_tensors: layer  18 assigned to device CUDA1, is_swa = 0                                                                                                                                                                          load_tensors: layer  19 assigned to device CUDA1, is_swa = 0                                                                                                                                                                          load_tensors: layer  20 assigned to device RPC0, is_swa = 0
load_tensors: layer  21 assigned to device RPC0, is_swa = 0
load_tensors: layer  22 assigned to device RPC0, is_swa = 0
load_tensors: layer  23 assigned to device RPC0, is_swa = 0
load_tensors: layer  24 assigned to device RPC0, is_swa = 0                                                                                                                                                                           load_tensors: layer  25 assigned to device RPC0, is_swa = 0
load_tensors: layer  26 assigned to device RPC0, is_swa = 0
load_tensors: layer  27 assigned to device RPC0, is_swa = 0
load_tensors: layer  28 assigned to device RPC0, is_swa = 0
load_tensors: layer  29 assigned to device RPC0, is_swa = 0
load_tensors: layer  30 assigned to device RPC0, is_swa = 0                                                                                                                                                                           load_tensors: layer  31 assigned to device RPC0, is_swa = 0
load_tensors: layer  32 assigned to device RPC0, is_swa = 0
load_tensors: layer  33 assigned to device RPC0, is_swa = 0
load_tensors: layer  34 assigned to device RPC0, is_swa = 0
load_tensors: layer  35 assigned to device RPC0, is_swa = 0
load_tensors: layer  36 assigned to device RPC0, is_swa = 0
load_tensors: layer  37 assigned to device RPC0, is_swa = 0
load_tensors: layer  38 assigned to device RPC0, is_swa = 0
load_tensors: layer  39 assigned to device RPC0, is_swa = 0
load_tensors: layer  40 assigned to device RPC0, is_swa = 0
load_tensors: layer  41 assigned to device RPC0, is_swa = 0
load_tensors: layer  42 assigned to device RPC0, is_swa = 0
load_tensors: layer  43 assigned to device RPC0, is_swa = 0
load_tensors: layer  44 assigned to device RPC0, is_swa = 0
load_tensors: layer  45 assigned to device RPC0, is_swa = 0
load_tensors: layer  46 assigned to device RPC0, is_swa = 0
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor output_norm.weight
create_tensor: loading tensor output.weight
create_tensor: loading tensor blk.0.attn_norm.weight
create_tensor: loading tensor blk.0.attn_q.weight
create_tensor: loading tensor blk.0.attn_k.weight
create_tensor: loading tensor blk.0.attn_v.weight
create_tensor: loading tensor blk.0.attn_q.bias
create_tensor: loading tensor blk.0.attn_k.bias
create_tensor: loading tensor blk.0.attn_v.bias
create_tensor: loading tensor blk.0.attn_output.weight
create_tensor: loading tensor blk.0.post_attention_norm.weight
create_tensor: loading tensor blk.0.ffn_gate.weight
create_tensor: loading tensor blk.0.ffn_down.weight
create_tensor: loading tensor blk.0.ffn_up.weight
create_tensor: loading tensor blk.1.attn_norm.weight
create_tensor: loading tensor blk.1.attn_q.weight                                                                                                                                                                                     create_tensor: loading tensor blk.1.attn_k.weight                                                                                                                                                                                     create_tensor: loading tensor blk.1.attn_v.weight                                                                                                                                                                                     create_tensor: loading tensor blk.1.attn_q.bias                                                                                                                                                                                       create_tensor: loading tensor blk.1.attn_k.bias                                                                                                                                                                                       create_tensor: loading tensor blk.1.attn_v.bias                                                                                                                                                                                       create_tensor: loading tensor blk.1.attn_output.weight                                                                                                                                                                                create_tensor: loading tensor blk.1.post_attention_norm.weight                                                                                                                                                                        create_tensor: loading tensor blk.1.ffn_gate_inp.weight                                                                                                                                                                               create_tensor: loading tensor blk.1.exp_probs_b.bias
create_tensor: loading tensor blk.1.ffn_gate_exps.weight
create_tensor: loading tensor blk.1.ffn_down_exps.weight
create_tensor: loading tensor blk.1.ffn_up_exps.weight
create_tensor: loading tensor blk.1.ffn_gate_shexp.weight
create_tensor: loading tensor blk.1.ffn_down_shexp.weight
create_tensor: loading tensor blk.1.ffn_up_shexp.weight
<snip duplicates>
create_tensor: loading tensor blk.45.attn_norm.weight
create_tensor: loading tensor blk.45.attn_q.weight
create_tensor: loading tensor blk.45.attn_k.weight
create_tensor: loading tensor blk.45.attn_v.weight
create_tensor: loading tensor blk.45.attn_q.bias
create_tensor: loading tensor blk.45.attn_k.bias
create_tensor: loading tensor blk.45.attn_v.bias
create_tensor: loading tensor blk.45.attn_output.weight
create_tensor: loading tensor blk.45.post_attention_norm.weight
create_tensor: loading tensor blk.45.ffn_gate_inp.weight
create_tensor: loading tensor blk.45.exp_probs_b.bias
create_tensor: loading tensor blk.45.ffn_gate_exps.weight
create_tensor: loading tensor blk.45.ffn_down_exps.weight
create_tensor: loading tensor blk.45.ffn_up_exps.weight
create_tensor: loading tensor blk.45.ffn_gate_shexp.weight
create_tensor: loading tensor blk.45.ffn_down_shexp.weight
create_tensor: loading tensor blk.45.ffn_up_shexp.weight
load_tensors: tensor 'token_embd.weight' (q6_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead                                                                                                 load_tensors: offloading output layer to GPU
load_tensors: offloading 45 repeating layers to GPU
load_tensors: offloaded 47/47 layers to GPU
load_tensors:          CPU model buffer size =     0.00 MiB
load_tensors:        CUDA0 model buffer size =     0.00 MiB
load_tensors:        CUDA1 model buffer size =     0.00 MiB
load_tensors: RPC0[192.168.1.139:50054] model buffer size =     0.00 MiB
llama_context: constructing llama_context                                                                                                                                                                                             llama_context: n_seq_max     = 4
llama_context: n_ctx         = 32768
llama_context: n_ctx_seq     = 32768
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = true
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:        CPU  output buffer size =     2.31 MiB
llama_kv_cache: layer   0: dev = CUDA0
llama_kv_cache: layer   1: dev = CUDA0
llama_kv_cache: layer   2: dev = CUDA0
llama_kv_cache: layer   3: dev = CUDA0
llama_kv_cache: layer   4: dev = CUDA0
llama_kv_cache: layer   5: dev = CUDA0
llama_kv_cache: layer   6: dev = CUDA0
llama_kv_cache: layer   7: dev = CUDA0
llama_kv_cache: layer   8: dev = CUDA0
llama_kv_cache: layer   9: dev = CUDA0
llama_kv_cache: layer  10: dev = CUDA1
llama_kv_cache: layer  11: dev = CUDA1
llama_kv_cache: layer  12: dev = CUDA1
llama_kv_cache: layer  13: dev = CUDA1
llama_kv_cache: layer  14: dev = CUDA1
llama_kv_cache: layer  15: dev = CUDA1
llama_kv_cache: layer  16: dev = CUDA1
llama_kv_cache: layer  17: dev = CUDA1
llama_kv_cache: layer  18: dev = CUDA1
llama_kv_cache: layer  19: dev = CUDA1
llama_kv_cache: layer  20: dev = RPC0
llama_kv_cache: layer  21: dev = RPC0
llama_kv_cache: layer  22: dev = RPC0
llama_kv_cache: layer  23: dev = RPC0
llama_kv_cache: layer  24: dev = RPC0
llama_kv_cache: layer  25: dev = RPC0
llama_kv_cache: layer  26: dev = RPC0
llama_kv_cache: layer  27: dev = RPC0
llama_kv_cache: layer  28: dev = RPC0
llama_kv_cache: layer  29: dev = RPC0
llama_kv_cache: layer  30: dev = RPC0
llama_kv_cache: layer  31: dev = RPC0
llama_kv_cache: layer  32: dev = RPC0
llama_kv_cache: layer  33: dev = RPC0
llama_kv_cache: layer  34: dev = RPC0
llama_kv_cache: layer  35: dev = RPC0
llama_kv_cache: layer  36: dev = RPC0
llama_kv_cache: layer  37: dev = RPC0
llama_kv_cache: layer  38: dev = RPC0
llama_kv_cache: layer  39: dev = RPC0
llama_kv_cache: layer  40: dev = RPC0
llama_kv_cache: layer  41: dev = RPC0                                                                                                                                                                                                 llama_kv_cache: layer  42: dev = RPC0
llama_kv_cache: layer  43: dev = RPC0
llama_kv_cache: layer  44: dev = RPC0
llama_kv_cache: layer  45: dev = RPC0
llama_kv_cache:      CUDA0 KV buffer size =     0.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =     0.00 MiB
llama_kv_cache: RPC0[192.168.1.139:50054] KV buffer size =     0.00 MiB
llama_kv_cache: size = 5888.00 MiB ( 32768 cells,  46 layers,  4/1 seqs), K (f16): 2944.00 MiB, V (f16): 2944.00 MiB                                                                                                                  llama_context: enumerating backends
llama_context: backend_ptrs.size() = 4
llama_context: max_nodes = 6240
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 4, n_outputs = 4
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  4, n_outputs =    4
graph_reserve: making n_tokens a multiple of n_seqs - n_tokens = 4, n_seqs = 4, n_outputs = 4
llama_context: Flash Attention was auto, set to enabled
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  4, n_outputs =  512
Segmentation fault (core dumped)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: FIT not calculating properly #18175

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: FIT not calculating properly #18175

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions