-
Notifications
You must be signed in to change notification settings - Fork 14.2k
Closed
Labels
Description
Name and Version
default option of fit=on (or manually set fit to on) reports that all buffers are 0.0 except the cpu context buffer, and crash. fit=off works fine.
Operating systems
Linux
GGML backends
RPC
Hardware
I am using 2 3090s with CUDA on system 1 and 1 amd 395+ with vulkan on system 2.
Models
glm 4.6v at q6k
Problem description & steps to reproduce
./build-rpc-cuda/bin/llama-server --rpc 192.168.1.139:50054 -m /dataset/model/zai-org_GLM-4.6V-Q6_K-00001-of-00003.gguf -ngl 999 -c 32768 --port 5001 --host 0.0.0.0 -ts 23,24,64 --api-key --no-mmap -dev CUDA0,CUDA1,RPC0 --chat-template-kwargs '{"enable_thinking": false}' --mmproj /dataset/model/mmproj-zai-org_GLM-4.6V-f16.gguf -fit on --verbose
First Bad Commit
No response
Relevant log output
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 7472 (4d1316c44) with GNU 13.3.0 for Linux x86_64
system info: n_threads = 6, n_threads_batch = 6, total_threads = 20
system_info: n_threads = 6 (n_threads_batch = 6) / 20 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
init: api_keys: ****a init: using 19 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model '/dataset/model/zai-org_GLM-4.6V-Q6_K-00001-of-00003.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: getting device memory data for initial parameters:
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 22100 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:02:00.0) - 23766 MiB free
llama_model_load_from_file_impl: using device RPC0 (192.168.1.139:50054) (unknown id) - 91878 MiB free
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 56 key-value pairs and 780 tensors from /dataset/model/zai-org_GLM-4.6V-Q6_K-00001-of-00003.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = glm4moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_k i32 = 2
llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.600000
llama_model_loader: - kv 4: general.sampling.temp f32 = 0.800000
llama_model_loader: - kv 5: general.name str = GLM 4.6V
llama_model_loader: - kv 6: general.finetune str = 4.6V
llama_model_loader: - kv 7: general.basename str = GLM
llama_model_loader: - kv 8: general.size_label str = 128x8.0B
llama_model_loader: - kv 9: general.license str = mit
llama_model_loader: - kv 10: general.tags arr[str,1] = ["image-text-to-text"]
llama_model_loader: - kv 11: general.languages arr[str,2] = ["zh", "en"]
llama_model_loader: - kv 12: glm4moe.block_count u32 = 46
llama_model_loader: - kv 13: glm4moe.context_length u32 = 131072
llama_model_loader: - kv 14: glm4moe.embedding_length u32 = 4096
llama_model_loader: - kv 15: glm4moe.feed_forward_length u32 = 10944
llama_model_loader: - kv 16: glm4moe.attention.head_count u32 = 96
llama_model_loader: - kv 17: glm4moe.attention.head_count_kv u32 = 8
llama_model_loader: - kv 18: glm4moe.rope.dimension_sections arr[i32,4] = [8, 12, 12, 0]
llama_model_loader: - kv 19: glm4moe.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 20: glm4moe.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 21: glm4moe.expert_used_count u32 = 8
llama_model_loader: - kv 22: glm4moe.expert_group_count u32 = 1
llama_model_loader: - kv 23: glm4moe.expert_group_used_count u32 = 1
llama_model_loader: - kv 24: glm4moe.attention.key_length u32 = 128
llama_model_loader: - kv 25: glm4moe.attention.value_length u32 = 128
llama_model_loader: - kv 26: glm4moe.rope.dimension_count u32 = 64
llama_model_loader: - kv 27: glm4moe.expert_count u32 = 128
llama_model_loader: - kv 28: glm4moe.expert_feed_forward_length u32 = 1408
llama_model_loader: - kv 29: glm4moe.expert_shared_count u32 = 1
llama_model_loader: - kv 30: glm4moe.leading_dense_block_count u32 = 1
llama_model_loader: - kv 31: glm4moe.expert_gating_func u32 = 2 llama_model_loader: - kv 32: glm4moe.expert_weights_scale f32 = 1.000000 llama_model_loader: - kv 33: glm4moe.expert_weights_norm bool = true llama_model_loader: - kv 34: glm4moe.nextn_predict_layers u32 = 0 llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 36: tokenizer.ggml.pre str = glm4 llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,151552] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,151552] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,318088] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 151329
llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 151329
llama_model_loader: - kv 42: tokenizer.ggml.bos_token_id u32 = 151331
llama_model_loader: - kv 43: tokenizer.ggml.eot_token_id u32 = 151336
llama_model_loader: - kv 44: tokenizer.ggml.unknown_token_id u32 = 151329
llama_model_loader: - kv 45: tokenizer.ggml.eom_token_id u32 = 151338
llama_model_loader: - kv 46: tokenizer.chat_template str = [gMASK]<sop>\n{%- if tools -%}\n<|syste... llama_model_loader: - kv 47: general.quantization_version u32 = 2 llama_model_loader: - kv 48: general.file_type u32 = 18
llama_model_loader: - kv 49: quantize.imatrix.file str = /models_out/GLM-4.6V-GGUF/zai-org_GLM... llama_model_loader: - kv 50: quantize.imatrix.dataset str = /training_dir/calibration_datav5.txt llama_model_loader: - kv 51: quantize.imatrix.entries_count u32 = 502
llama_model_loader: - kv 52: quantize.imatrix.chunks_count u32 = 802
llama_model_loader: - kv 53: split.no u16 = 0
llama_model_loader: - kv 54: split.tensors.count i32 = 780
llama_model_loader: - kv 55: split.count u16 = 3 llama_model_loader: - type f32: 321 tensors
llama_model_loader: - type q5_1: 23 tensors
llama_model_loader: - type q8_0: 158 tensors
llama_model_loader: - type q5_K: 46 tensors
llama_model_loader: - type q6_K: 232 tensors
print_info: file format = GGUF V3 (latest) print_info: file type = Q6_K
print_info: file size = 89.30 GiB (7.18 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 151363 '<|image|>' is not marked as EOG
load: control token: 151362 '<|end_of_box|>' is not marked as EOG
load: control token: 151361 '<|begin_of_box|>' is not marked as EOG
load: control token: 151349 '<|code_suffix|>' is not marked as EOG
load: control token: 151348 '<|code_middle|>' is not marked as EOG
load: control token: 151346 '<|end_of_transcription|>' is not marked as EOG
load: control token: 151343 '<|begin_of_audio|>' is not marked as EOG
load: control token: 151342 '<|end_of_video|>' is not marked as EOG
load: control token: 151341 '<|begin_of_video|>' is not marked as EOG
load: control token: 151338 '<|observation|>' is not marked as EOG
load: control token: 151333 '<sop>' is not marked as EOG
load: control token: 151331 '[gMASK]' is not marked as EOG
load: control token: 151330 '[MASK]' is not marked as EOG
load: control token: 151347 '<|code_prefix|>' is not marked as EOG
load: control token: 151360 '/nothink' is not marked as EOG
load: control token: 151337 '<|assistant|>' is not marked as EOG
load: control token: 151332 '[sMASK]' is not marked as EOG
load: control token: 151334 '<eop>' is not marked as EOG
load: control token: 151335 '<|system|>' is not marked as EOG
load: control token: 151336 '<|user|>' is not marked as EOG
load: control token: 151340 '<|end_of_image|>' is not marked as EOG
load: control token: 151339 '<|begin_of_image|>' is not marked as EOG
load: control token: 151364 '<|video|>' is not marked as EOG
load: control token: 151345 '<|begin_of_transcription|>' is not marked as EOG
load: control token: 151344 '<|end_of_audio|>' is not marked as EOG
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special_eom_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 151329 ('<|endoftext|>')
load: - 151336 ('<|user|>') load: - 151338 ('<|observation|>') load: special tokens cache size = 36 load: token to piece cache size = 0.9713 MB print_info: arch = glm4moe print_info: vocab_only = 0 print_info: no_alloc = 1 print_info: n_ctx_train = 131072 print_info: n_embd = 4096 print_info: n_embd_inp = 4096
print_info: n_layer = 46
print_info: n_head = 96
print_info: n_head_kv = 8
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128
print_info: n_gqa = 12 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 10944
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: n_expert_groups = 1 print_info: n_group_used = 1
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 8
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned = unknown
print_info: mrope sections = [8, 12, 12, 0]
print_info: model type = ?B
print_info: model params = 106.85 B
print_info: general.name = GLM 4.6V
print_info: vocab type = BPE
print_info: n_vocab = 151552
print_info: n_merges = 318088
print_info: BOS token = 151331 '[gMASK]'
print_info: EOS token = 151329 '<|endoftext|>'
print_info: EOT token = 151336 '<|user|>'
print_info: EOM token = 151338 '<|observation|>'
print_info: UNK token = 151329 '<|endoftext|>'
print_info: PAD token = 151329 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151347 '<|code_prefix|>'
print_info: FIM SUF token = 151349 '<|code_suffix|>'
print_info: FIM MID token = 151348 '<|code_middle|>'
print_info: EOG token = 151329 '<|endoftext|>'
print_info: EOG token = 151336 '<|user|>'
print_info: EOG token = 151338 '<|observation|>'
print_info: max token length = 1024
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: layer 0 assigned to device CUDA0, is_swa = 0 load_tensors: layer 1 assigned to device CUDA0, is_swa = 0 load_tensors: layer 2 assigned to device CUDA0, is_swa = 0 load_tensors: layer 3 assigned to device CUDA0, is_swa = 0 load_tensors: layer 4 assigned to device CUDA0, is_swa = 0 load_tensors: layer 5 assigned to device CUDA0, is_swa = 0 load_tensors: layer 6 assigned to device CUDA0, is_swa = 0 load_tensors: layer 7 assigned to device CUDA0, is_swa = 0 load_tensors: layer 8 assigned to device CUDA0, is_swa = 0 load_tensors: layer 9 assigned to device CUDA0, is_swa = 0
load_tensors: layer 10 assigned to device CUDA1, is_swa = 0
load_tensors: layer 11 assigned to device CUDA1, is_swa = 0
load_tensors: layer 12 assigned to device CUDA1, is_swa = 0
load_tensors: layer 13 assigned to device CUDA1, is_swa = 0
load_tensors: layer 14 assigned to device CUDA1, is_swa = 0
load_tensors: layer 15 assigned to device CUDA1, is_swa = 0 load_tensors: layer 16 assigned to device CUDA1, is_swa = 0 load_tensors: layer 17 assigned to device CUDA1, is_swa = 0
load_tensors: layer 18 assigned to device CUDA1, is_swa = 0 load_tensors: layer 19 assigned to device CUDA1, is_swa = 0 load_tensors: layer 20 assigned to device RPC0, is_swa = 0
load_tensors: layer 21 assigned to device RPC0, is_swa = 0
load_tensors: layer 22 assigned to device RPC0, is_swa = 0
load_tensors: layer 23 assigned to device RPC0, is_swa = 0
load_tensors: layer 24 assigned to device RPC0, is_swa = 0 load_tensors: layer 25 assigned to device RPC0, is_swa = 0
load_tensors: layer 26 assigned to device RPC0, is_swa = 0
load_tensors: layer 27 assigned to device RPC0, is_swa = 0
load_tensors: layer 28 assigned to device RPC0, is_swa = 0
load_tensors: layer 29 assigned to device RPC0, is_swa = 0
load_tensors: layer 30 assigned to device RPC0, is_swa = 0 load_tensors: layer 31 assigned to device RPC0, is_swa = 0
load_tensors: layer 32 assigned to device RPC0, is_swa = 0
load_tensors: layer 33 assigned to device RPC0, is_swa = 0
load_tensors: layer 34 assigned to device RPC0, is_swa = 0
load_tensors: layer 35 assigned to device RPC0, is_swa = 0
load_tensors: layer 36 assigned to device RPC0, is_swa = 0
load_tensors: layer 37 assigned to device RPC0, is_swa = 0
load_tensors: layer 38 assigned to device RPC0, is_swa = 0
load_tensors: layer 39 assigned to device RPC0, is_swa = 0
load_tensors: layer 40 assigned to device RPC0, is_swa = 0
load_tensors: layer 41 assigned to device RPC0, is_swa = 0
load_tensors: layer 42 assigned to device RPC0, is_swa = 0
load_tensors: layer 43 assigned to device RPC0, is_swa = 0
load_tensors: layer 44 assigned to device RPC0, is_swa = 0
load_tensors: layer 45 assigned to device RPC0, is_swa = 0
load_tensors: layer 46 assigned to device RPC0, is_swa = 0
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor output_norm.weight
create_tensor: loading tensor output.weight
create_tensor: loading tensor blk.0.attn_norm.weight
create_tensor: loading tensor blk.0.attn_q.weight
create_tensor: loading tensor blk.0.attn_k.weight
create_tensor: loading tensor blk.0.attn_v.weight
create_tensor: loading tensor blk.0.attn_q.bias
create_tensor: loading tensor blk.0.attn_k.bias
create_tensor: loading tensor blk.0.attn_v.bias
create_tensor: loading tensor blk.0.attn_output.weight
create_tensor: loading tensor blk.0.post_attention_norm.weight
create_tensor: loading tensor blk.0.ffn_gate.weight
create_tensor: loading tensor blk.0.ffn_down.weight
create_tensor: loading tensor blk.0.ffn_up.weight
create_tensor: loading tensor blk.1.attn_norm.weight
create_tensor: loading tensor blk.1.attn_q.weight create_tensor: loading tensor blk.1.attn_k.weight create_tensor: loading tensor blk.1.attn_v.weight create_tensor: loading tensor blk.1.attn_q.bias create_tensor: loading tensor blk.1.attn_k.bias create_tensor: loading tensor blk.1.attn_v.bias create_tensor: loading tensor blk.1.attn_output.weight create_tensor: loading tensor blk.1.post_attention_norm.weight create_tensor: loading tensor blk.1.ffn_gate_inp.weight create_tensor: loading tensor blk.1.exp_probs_b.bias
create_tensor: loading tensor blk.1.ffn_gate_exps.weight
create_tensor: loading tensor blk.1.ffn_down_exps.weight
create_tensor: loading tensor blk.1.ffn_up_exps.weight
create_tensor: loading tensor blk.1.ffn_gate_shexp.weight
create_tensor: loading tensor blk.1.ffn_down_shexp.weight
create_tensor: loading tensor blk.1.ffn_up_shexp.weight
<snip duplicates>
create_tensor: loading tensor blk.45.attn_norm.weight
create_tensor: loading tensor blk.45.attn_q.weight
create_tensor: loading tensor blk.45.attn_k.weight
create_tensor: loading tensor blk.45.attn_v.weight
create_tensor: loading tensor blk.45.attn_q.bias
create_tensor: loading tensor blk.45.attn_k.bias
create_tensor: loading tensor blk.45.attn_v.bias
create_tensor: loading tensor blk.45.attn_output.weight
create_tensor: loading tensor blk.45.post_attention_norm.weight
create_tensor: loading tensor blk.45.ffn_gate_inp.weight
create_tensor: loading tensor blk.45.exp_probs_b.bias
create_tensor: loading tensor blk.45.ffn_gate_exps.weight
create_tensor: loading tensor blk.45.ffn_down_exps.weight
create_tensor: loading tensor blk.45.ffn_up_exps.weight
create_tensor: loading tensor blk.45.ffn_gate_shexp.weight
create_tensor: loading tensor blk.45.ffn_down_shexp.weight
create_tensor: loading tensor blk.45.ffn_up_shexp.weight
load_tensors: tensor 'token_embd.weight' (q6_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead load_tensors: offloading output layer to GPU
load_tensors: offloading 45 repeating layers to GPU
load_tensors: offloaded 47/47 layers to GPU
load_tensors: CPU model buffer size = 0.00 MiB
load_tensors: CUDA0 model buffer size = 0.00 MiB
load_tensors: CUDA1 model buffer size = 0.00 MiB
load_tensors: RPC0[192.168.1.139:50054] model buffer size = 0.00 MiB
llama_context: constructing llama_context llama_context: n_seq_max = 4
llama_context: n_ctx = 32768
llama_context: n_ctx_seq = 32768
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context: CPU output buffer size = 2.31 MiB
llama_kv_cache: layer 0: dev = CUDA0
llama_kv_cache: layer 1: dev = CUDA0
llama_kv_cache: layer 2: dev = CUDA0
llama_kv_cache: layer 3: dev = CUDA0
llama_kv_cache: layer 4: dev = CUDA0
llama_kv_cache: layer 5: dev = CUDA0
llama_kv_cache: layer 6: dev = CUDA0
llama_kv_cache: layer 7: dev = CUDA0
llama_kv_cache: layer 8: dev = CUDA0
llama_kv_cache: layer 9: dev = CUDA0
llama_kv_cache: layer 10: dev = CUDA1
llama_kv_cache: layer 11: dev = CUDA1
llama_kv_cache: layer 12: dev = CUDA1
llama_kv_cache: layer 13: dev = CUDA1
llama_kv_cache: layer 14: dev = CUDA1
llama_kv_cache: layer 15: dev = CUDA1
llama_kv_cache: layer 16: dev = CUDA1
llama_kv_cache: layer 17: dev = CUDA1
llama_kv_cache: layer 18: dev = CUDA1
llama_kv_cache: layer 19: dev = CUDA1
llama_kv_cache: layer 20: dev = RPC0
llama_kv_cache: layer 21: dev = RPC0
llama_kv_cache: layer 22: dev = RPC0
llama_kv_cache: layer 23: dev = RPC0
llama_kv_cache: layer 24: dev = RPC0
llama_kv_cache: layer 25: dev = RPC0
llama_kv_cache: layer 26: dev = RPC0
llama_kv_cache: layer 27: dev = RPC0
llama_kv_cache: layer 28: dev = RPC0
llama_kv_cache: layer 29: dev = RPC0
llama_kv_cache: layer 30: dev = RPC0
llama_kv_cache: layer 31: dev = RPC0
llama_kv_cache: layer 32: dev = RPC0
llama_kv_cache: layer 33: dev = RPC0
llama_kv_cache: layer 34: dev = RPC0
llama_kv_cache: layer 35: dev = RPC0
llama_kv_cache: layer 36: dev = RPC0
llama_kv_cache: layer 37: dev = RPC0
llama_kv_cache: layer 38: dev = RPC0
llama_kv_cache: layer 39: dev = RPC0
llama_kv_cache: layer 40: dev = RPC0
llama_kv_cache: layer 41: dev = RPC0 llama_kv_cache: layer 42: dev = RPC0
llama_kv_cache: layer 43: dev = RPC0
llama_kv_cache: layer 44: dev = RPC0
llama_kv_cache: layer 45: dev = RPC0
llama_kv_cache: CUDA0 KV buffer size = 0.00 MiB
llama_kv_cache: CUDA1 KV buffer size = 0.00 MiB
llama_kv_cache: RPC0[192.168.1.139:50054] KV buffer size = 0.00 MiB
llama_kv_cache: size = 5888.00 MiB ( 32768 cells, 46 layers, 4/1 seqs), K (f16): 2944.00 MiB, V (f16): 2944.00 MiB llama_context: enumerating backends
llama_context: backend_ptrs.size() = 4
llama_context: max_nodes = 6240
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 4, n_outputs = 4
graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 4, n_outputs = 4
graph_reserve: making n_tokens a multiple of n_seqs - n_tokens = 4, n_seqs = 4, n_outputs = 4
llama_context: Flash Attention was auto, set to enabled
graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 4, n_outputs = 512
Segmentation fault (core dumped)