Skip to content

Eval bug: FIT not calculating properly #18175

@yggdrasil75

Description

@yggdrasil75

Name and Version

default option of fit=on (or manually set fit to on) reports that all buffers are 0.0 except the cpu context buffer, and crash. fit=off works fine.

Operating systems

Linux

GGML backends

RPC

Hardware

I am using 2 3090s with CUDA on system 1 and 1 amd 395+ with vulkan on system 2.

Models

glm 4.6v at q6k

Problem description & steps to reproduce

./build-rpc-cuda/bin/llama-server --rpc 192.168.1.139:50054 -m /dataset/model/zai-org_GLM-4.6V-Q6_K-00001-of-00003.gguf -ngl 999 -c 32768 --port 5001 --host 0.0.0.0 -ts 23,24,64 --api-key --no-mmap -dev CUDA0,CUDA1,RPC0 --chat-template-kwargs '{"enable_thinking": false}' --mmproj /dataset/model/mmproj-zai-org_GLM-4.6V-f16.gguf -fit on --verbose

First Bad Commit

No response

Relevant log output

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no                                                                                                                                                                                            ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no                                                                                                                                                                                            ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 7472 (4d1316c44) with GNU 13.3.0 for Linux x86_64
system info: n_threads = 6, n_threads_batch = 6, total_threads = 20
                                                                                                                                                                                                                                      system_info: n_threads = 6 (n_threads_batch = 6) / 20 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
                                                                                                                                                                                                                                      init: api_keys: ****a                                                                                                                                                                                                                 init: using 19 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '/dataset/model/zai-org_GLM-4.6V-Q6_K-00001-of-00003.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: getting device memory data for initial parameters:
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 22100 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:02:00.0) - 23766 MiB free
llama_model_load_from_file_impl: using device RPC0 (192.168.1.139:50054) (unknown id) - 91878 MiB free
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 56 key-value pairs and 780 tensors from /dataset/model/zai-org_GLM-4.6V-Q6_K-00001-of-00003.gguf (version GGUF V3 (latest))                                                                 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = glm4moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 2
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.600000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 0.800000
llama_model_loader: - kv   5:                               general.name str              = GLM 4.6V
llama_model_loader: - kv   6:                           general.finetune str              = 4.6V
llama_model_loader: - kv   7:                           general.basename str              = GLM
llama_model_loader: - kv   8:                         general.size_label str              = 128x8.0B
llama_model_loader: - kv   9:                            general.license str              = mit
llama_model_loader: - kv  10:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  11:                          general.languages arr[str,2]       = ["zh", "en"]
llama_model_loader: - kv  12:                        glm4moe.block_count u32              = 46
llama_model_loader: - kv  13:                     glm4moe.context_length u32              = 131072
llama_model_loader: - kv  14:                   glm4moe.embedding_length u32              = 4096
llama_model_loader: - kv  15:                glm4moe.feed_forward_length u32              = 10944
llama_model_loader: - kv  16:               glm4moe.attention.head_count u32              = 96
llama_model_loader: - kv  17:            glm4moe.attention.head_count_kv u32              = 8
llama_model_loader: - kv  18:            glm4moe.rope.dimension_sections arr[i32,4]       = [8, 12, 12, 0]
llama_model_loader: - kv  19:                     glm4moe.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  20:   glm4moe.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  21:                  glm4moe.expert_used_count u32              = 8
llama_model_loader: - kv  22:                 glm4moe.expert_group_count u32              = 1
llama_model_loader: - kv  23:            glm4moe.expert_group_used_count u32              = 1
llama_model_loader: - kv  24:               glm4moe.attention.key_length u32              = 128
llama_model_loader: - kv  25:             glm4moe.attention.value_length u32              = 128
llama_model_loader: - kv  26:               glm4moe.rope.dimension_count u32              = 64
llama_model_loader: - kv  27:                       glm4moe.expert_count u32              = 128
llama_model_loader: - kv  28:         glm4moe.expert_feed_forward_length u32              = 1408
llama_model_loader: - kv  29:                glm4moe.expert_shared_count u32              = 1
llama_model_loader: - kv  30:          glm4moe.leading_dense_block_count u32              = 1
llama_model_loader: - kv  31:                 glm4moe.expert_gating_func u32              = 2                                                                                                                                         llama_model_loader: - kv  32:               glm4moe.expert_weights_scale f32              = 1.000000                                                                                                                                  llama_model_loader: - kv  33:                glm4moe.expert_weights_norm bool             = true                                                                                                                                      llama_model_loader: - kv  34:               glm4moe.nextn_predict_layers u32              = 0                                                                                                                                         llama_model_loader: - kv  35:                       tokenizer.ggml.model str              = gpt2                                                                                                                                      llama_model_loader: - kv  36:                         tokenizer.ggml.pre str              = glm4                                                                                                                                      llama_model_loader: - kv  37:                      tokenizer.ggml.tokens arr[str,151552]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  38:                  tokenizer.ggml.token_type arr[i32,151552]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...                                                                                                  llama_model_loader: - kv  39:                      tokenizer.ggml.merges arr[str,318088]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...                                                                                                            llama_model_loader: - kv  40:                tokenizer.ggml.eos_token_id u32              = 151329
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 151329
llama_model_loader: - kv  42:                tokenizer.ggml.bos_token_id u32              = 151331
llama_model_loader: - kv  43:                tokenizer.ggml.eot_token_id u32              = 151336
llama_model_loader: - kv  44:            tokenizer.ggml.unknown_token_id u32              = 151329
llama_model_loader: - kv  45:                tokenizer.ggml.eom_token_id u32              = 151338
llama_model_loader: - kv  46:                    tokenizer.chat_template str              = [gMASK]<sop>\n{%- if tools -%}\n<|syste...                                                                                                llama_model_loader: - kv  47:               general.quantization_version u32              = 2                                                                                                                                         llama_model_loader: - kv  48:                          general.file_type u32              = 18
llama_model_loader: - kv  49:                      quantize.imatrix.file str              = /models_out/GLM-4.6V-GGUF/zai-org_GLM...                                                                                                  llama_model_loader: - kv  50:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav5.txt                                                                                                      llama_model_loader: - kv  51:             quantize.imatrix.entries_count u32              = 502
llama_model_loader: - kv  52:              quantize.imatrix.chunks_count u32              = 802
llama_model_loader: - kv  53:                                   split.no u16              = 0
llama_model_loader: - kv  54:                        split.tensors.count i32              = 780
llama_model_loader: - kv  55:                                split.count u16              = 3                                                                                                                                         llama_model_loader: - type  f32:  321 tensors
llama_model_loader: - type q5_1:   23 tensors
llama_model_loader: - type q8_0:  158 tensors
llama_model_loader: - type q5_K:   46 tensors
llama_model_loader: - type q6_K:  232 tensors
print_info: file format = GGUF V3 (latest)                                                                                                                                                                                            print_info: file type   = Q6_K
print_info: file size   = 89.30 GiB (7.18 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 151363 '<|image|>' is not marked as EOG
load: control token: 151362 '<|end_of_box|>' is not marked as EOG
load: control token: 151361 '<|begin_of_box|>' is not marked as EOG
load: control token: 151349 '<|code_suffix|>' is not marked as EOG
load: control token: 151348 '<|code_middle|>' is not marked as EOG
load: control token: 151346 '<|end_of_transcription|>' is not marked as EOG
load: control token: 151343 '<|begin_of_audio|>' is not marked as EOG
load: control token: 151342 '<|end_of_video|>' is not marked as EOG
load: control token: 151341 '<|begin_of_video|>' is not marked as EOG
load: control token: 151338 '<|observation|>' is not marked as EOG
load: control token: 151333 '<sop>' is not marked as EOG
load: control token: 151331 '[gMASK]' is not marked as EOG
load: control token: 151330 '[MASK]' is not marked as EOG
load: control token: 151347 '<|code_prefix|>' is not marked as EOG
load: control token: 151360 '/nothink' is not marked as EOG
load: control token: 151337 '<|assistant|>' is not marked as EOG
load: control token: 151332 '[sMASK]' is not marked as EOG
load: control token: 151334 '<eop>' is not marked as EOG
load: control token: 151335 '<|system|>' is not marked as EOG
load: control token: 151336 '<|user|>' is not marked as EOG
load: control token: 151340 '<|end_of_image|>' is not marked as EOG
load: control token: 151339 '<|begin_of_image|>' is not marked as EOG
load: control token: 151364 '<|video|>' is not marked as EOG
load: control token: 151345 '<|begin_of_transcription|>' is not marked as EOG
load: control token: 151344 '<|end_of_audio|>' is not marked as EOG
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special_eom_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 151329 ('<|endoftext|>')
load:   - 151336 ('<|user|>')                                                                                                                                                                                                         load:   - 151338 ('<|observation|>')                                                                                                                                                                                                  load: special tokens cache size = 36                                                                                                                                                                                                  load: token to piece cache size = 0.9713 MB                                                                                                                                                                                           print_info: arch             = glm4moe                                                                                                                                                                                                print_info: vocab_only       = 0                                                                                                                                                                                                      print_info: no_alloc         = 1                                                                                                                                                                                                      print_info: n_ctx_train      = 131072                                                                                                                                                                                                 print_info: n_embd           = 4096                                                                                                                                                                                                   print_info: n_embd_inp       = 4096
print_info: n_layer          = 46
print_info: n_head           = 96
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0                                                                                                                                                                                                      print_info: n_embd_head_k    = 128                                                                                                                                                                                                    print_info: n_embd_head_v    = 128
print_info: n_gqa            = 12                                                                                                                                                                                                     print_info: n_embd_k_gqa     = 1024                                                                                                                                                                                                   print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00                                                                                                                                                                                                print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 10944
print_info: n_expert         = 128
print_info: n_expert_used    = 8
print_info: n_expert_groups  = 1                                                                                                                                                                                                      print_info: n_group_used     = 1
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 8
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned   = unknown
print_info: mrope sections   = [8, 12, 12, 0]
print_info: model type       = ?B
print_info: model params     = 106.85 B
print_info: general.name     = GLM 4.6V
print_info: vocab type       = BPE
print_info: n_vocab          = 151552
print_info: n_merges         = 318088
print_info: BOS token        = 151331 '[gMASK]'
print_info: EOS token        = 151329 '<|endoftext|>'
print_info: EOT token        = 151336 '<|user|>'
print_info: EOM token        = 151338 '<|observation|>'
print_info: UNK token        = 151329 '<|endoftext|>'
print_info: PAD token        = 151329 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151347 '<|code_prefix|>'
print_info: FIM SUF token    = 151349 '<|code_suffix|>'
print_info: FIM MID token    = 151348 '<|code_middle|>'
print_info: EOG token        = 151329 '<|endoftext|>'
print_info: EOG token        = 151336 '<|user|>'
print_info: EOG token        = 151338 '<|observation|>'
print_info: max token length = 1024
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: layer   0 assigned to device CUDA0, is_swa = 0                                                                                                                                                                          load_tensors: layer   1 assigned to device CUDA0, is_swa = 0                                                                                                                                                                          load_tensors: layer   2 assigned to device CUDA0, is_swa = 0                                                                                                                                                                          load_tensors: layer   3 assigned to device CUDA0, is_swa = 0                                                                                                                                                                          load_tensors: layer   4 assigned to device CUDA0, is_swa = 0                                                                                                                                                                          load_tensors: layer   5 assigned to device CUDA0, is_swa = 0                                                                                                                                                                          load_tensors: layer   6 assigned to device CUDA0, is_swa = 0                                                                                                                                                                          load_tensors: layer   7 assigned to device CUDA0, is_swa = 0                                                                                                                                                                          load_tensors: layer   8 assigned to device CUDA0, is_swa = 0                                                                                                                                                                          load_tensors: layer   9 assigned to device CUDA0, is_swa = 0
load_tensors: layer  10 assigned to device CUDA1, is_swa = 0
load_tensors: layer  11 assigned to device CUDA1, is_swa = 0
load_tensors: layer  12 assigned to device CUDA1, is_swa = 0
load_tensors: layer  13 assigned to device CUDA1, is_swa = 0
load_tensors: layer  14 assigned to device CUDA1, is_swa = 0
load_tensors: layer  15 assigned to device CUDA1, is_swa = 0                                                                                                                                                                          load_tensors: layer  16 assigned to device CUDA1, is_swa = 0                                                                                                                                                                          load_tensors: layer  17 assigned to device CUDA1, is_swa = 0
load_tensors: layer  18 assigned to device CUDA1, is_swa = 0                                                                                                                                                                          load_tensors: layer  19 assigned to device CUDA1, is_swa = 0                                                                                                                                                                          load_tensors: layer  20 assigned to device RPC0, is_swa = 0
load_tensors: layer  21 assigned to device RPC0, is_swa = 0
load_tensors: layer  22 assigned to device RPC0, is_swa = 0
load_tensors: layer  23 assigned to device RPC0, is_swa = 0
load_tensors: layer  24 assigned to device RPC0, is_swa = 0                                                                                                                                                                           load_tensors: layer  25 assigned to device RPC0, is_swa = 0
load_tensors: layer  26 assigned to device RPC0, is_swa = 0
load_tensors: layer  27 assigned to device RPC0, is_swa = 0
load_tensors: layer  28 assigned to device RPC0, is_swa = 0
load_tensors: layer  29 assigned to device RPC0, is_swa = 0
load_tensors: layer  30 assigned to device RPC0, is_swa = 0                                                                                                                                                                           load_tensors: layer  31 assigned to device RPC0, is_swa = 0
load_tensors: layer  32 assigned to device RPC0, is_swa = 0
load_tensors: layer  33 assigned to device RPC0, is_swa = 0
load_tensors: layer  34 assigned to device RPC0, is_swa = 0
load_tensors: layer  35 assigned to device RPC0, is_swa = 0
load_tensors: layer  36 assigned to device RPC0, is_swa = 0
load_tensors: layer  37 assigned to device RPC0, is_swa = 0
load_tensors: layer  38 assigned to device RPC0, is_swa = 0
load_tensors: layer  39 assigned to device RPC0, is_swa = 0
load_tensors: layer  40 assigned to device RPC0, is_swa = 0
load_tensors: layer  41 assigned to device RPC0, is_swa = 0
load_tensors: layer  42 assigned to device RPC0, is_swa = 0
load_tensors: layer  43 assigned to device RPC0, is_swa = 0
load_tensors: layer  44 assigned to device RPC0, is_swa = 0
load_tensors: layer  45 assigned to device RPC0, is_swa = 0
load_tensors: layer  46 assigned to device RPC0, is_swa = 0
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor output_norm.weight
create_tensor: loading tensor output.weight
create_tensor: loading tensor blk.0.attn_norm.weight
create_tensor: loading tensor blk.0.attn_q.weight
create_tensor: loading tensor blk.0.attn_k.weight
create_tensor: loading tensor blk.0.attn_v.weight
create_tensor: loading tensor blk.0.attn_q.bias
create_tensor: loading tensor blk.0.attn_k.bias
create_tensor: loading tensor blk.0.attn_v.bias
create_tensor: loading tensor blk.0.attn_output.weight
create_tensor: loading tensor blk.0.post_attention_norm.weight
create_tensor: loading tensor blk.0.ffn_gate.weight
create_tensor: loading tensor blk.0.ffn_down.weight
create_tensor: loading tensor blk.0.ffn_up.weight
create_tensor: loading tensor blk.1.attn_norm.weight
create_tensor: loading tensor blk.1.attn_q.weight                                                                                                                                                                                     create_tensor: loading tensor blk.1.attn_k.weight                                                                                                                                                                                     create_tensor: loading tensor blk.1.attn_v.weight                                                                                                                                                                                     create_tensor: loading tensor blk.1.attn_q.bias                                                                                                                                                                                       create_tensor: loading tensor blk.1.attn_k.bias                                                                                                                                                                                       create_tensor: loading tensor blk.1.attn_v.bias                                                                                                                                                                                       create_tensor: loading tensor blk.1.attn_output.weight                                                                                                                                                                                create_tensor: loading tensor blk.1.post_attention_norm.weight                                                                                                                                                                        create_tensor: loading tensor blk.1.ffn_gate_inp.weight                                                                                                                                                                               create_tensor: loading tensor blk.1.exp_probs_b.bias
create_tensor: loading tensor blk.1.ffn_gate_exps.weight
create_tensor: loading tensor blk.1.ffn_down_exps.weight
create_tensor: loading tensor blk.1.ffn_up_exps.weight
create_tensor: loading tensor blk.1.ffn_gate_shexp.weight
create_tensor: loading tensor blk.1.ffn_down_shexp.weight
create_tensor: loading tensor blk.1.ffn_up_shexp.weight
<snip duplicates>
create_tensor: loading tensor blk.45.attn_norm.weight
create_tensor: loading tensor blk.45.attn_q.weight
create_tensor: loading tensor blk.45.attn_k.weight
create_tensor: loading tensor blk.45.attn_v.weight
create_tensor: loading tensor blk.45.attn_q.bias
create_tensor: loading tensor blk.45.attn_k.bias
create_tensor: loading tensor blk.45.attn_v.bias
create_tensor: loading tensor blk.45.attn_output.weight
create_tensor: loading tensor blk.45.post_attention_norm.weight
create_tensor: loading tensor blk.45.ffn_gate_inp.weight
create_tensor: loading tensor blk.45.exp_probs_b.bias
create_tensor: loading tensor blk.45.ffn_gate_exps.weight
create_tensor: loading tensor blk.45.ffn_down_exps.weight
create_tensor: loading tensor blk.45.ffn_up_exps.weight
create_tensor: loading tensor blk.45.ffn_gate_shexp.weight
create_tensor: loading tensor blk.45.ffn_down_shexp.weight
create_tensor: loading tensor blk.45.ffn_up_shexp.weight
load_tensors: tensor 'token_embd.weight' (q6_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead                                                                                                 load_tensors: offloading output layer to GPU
load_tensors: offloading 45 repeating layers to GPU
load_tensors: offloaded 47/47 layers to GPU
load_tensors:          CPU model buffer size =     0.00 MiB
load_tensors:        CUDA0 model buffer size =     0.00 MiB
load_tensors:        CUDA1 model buffer size =     0.00 MiB
load_tensors: RPC0[192.168.1.139:50054] model buffer size =     0.00 MiB
llama_context: constructing llama_context                                                                                                                                                                                             llama_context: n_seq_max     = 4
llama_context: n_ctx         = 32768
llama_context: n_ctx_seq     = 32768
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = true
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:        CPU  output buffer size =     2.31 MiB
llama_kv_cache: layer   0: dev = CUDA0
llama_kv_cache: layer   1: dev = CUDA0
llama_kv_cache: layer   2: dev = CUDA0
llama_kv_cache: layer   3: dev = CUDA0
llama_kv_cache: layer   4: dev = CUDA0
llama_kv_cache: layer   5: dev = CUDA0
llama_kv_cache: layer   6: dev = CUDA0
llama_kv_cache: layer   7: dev = CUDA0
llama_kv_cache: layer   8: dev = CUDA0
llama_kv_cache: layer   9: dev = CUDA0
llama_kv_cache: layer  10: dev = CUDA1
llama_kv_cache: layer  11: dev = CUDA1
llama_kv_cache: layer  12: dev = CUDA1
llama_kv_cache: layer  13: dev = CUDA1
llama_kv_cache: layer  14: dev = CUDA1
llama_kv_cache: layer  15: dev = CUDA1
llama_kv_cache: layer  16: dev = CUDA1
llama_kv_cache: layer  17: dev = CUDA1
llama_kv_cache: layer  18: dev = CUDA1
llama_kv_cache: layer  19: dev = CUDA1
llama_kv_cache: layer  20: dev = RPC0
llama_kv_cache: layer  21: dev = RPC0
llama_kv_cache: layer  22: dev = RPC0
llama_kv_cache: layer  23: dev = RPC0
llama_kv_cache: layer  24: dev = RPC0
llama_kv_cache: layer  25: dev = RPC0
llama_kv_cache: layer  26: dev = RPC0
llama_kv_cache: layer  27: dev = RPC0
llama_kv_cache: layer  28: dev = RPC0
llama_kv_cache: layer  29: dev = RPC0
llama_kv_cache: layer  30: dev = RPC0
llama_kv_cache: layer  31: dev = RPC0
llama_kv_cache: layer  32: dev = RPC0
llama_kv_cache: layer  33: dev = RPC0
llama_kv_cache: layer  34: dev = RPC0
llama_kv_cache: layer  35: dev = RPC0
llama_kv_cache: layer  36: dev = RPC0
llama_kv_cache: layer  37: dev = RPC0
llama_kv_cache: layer  38: dev = RPC0
llama_kv_cache: layer  39: dev = RPC0
llama_kv_cache: layer  40: dev = RPC0
llama_kv_cache: layer  41: dev = RPC0                                                                                                                                                                                                 llama_kv_cache: layer  42: dev = RPC0
llama_kv_cache: layer  43: dev = RPC0
llama_kv_cache: layer  44: dev = RPC0
llama_kv_cache: layer  45: dev = RPC0
llama_kv_cache:      CUDA0 KV buffer size =     0.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =     0.00 MiB
llama_kv_cache: RPC0[192.168.1.139:50054] KV buffer size =     0.00 MiB
llama_kv_cache: size = 5888.00 MiB ( 32768 cells,  46 layers,  4/1 seqs), K (f16): 2944.00 MiB, V (f16): 2944.00 MiB                                                                                                                  llama_context: enumerating backends
llama_context: backend_ptrs.size() = 4
llama_context: max_nodes = 6240
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 4, n_outputs = 4
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  4, n_outputs =    4
graph_reserve: making n_tokens a multiple of n_seqs - n_tokens = 4, n_seqs = 4, n_outputs = 4
llama_context: Flash Attention was auto, set to enabled
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  4, n_outputs =  512
Segmentation fault (core dumped)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions