Summary
Loading google/gemma-4-E4B-it (or mlx-community/gemma-4-e4b-it-bf16) crashes with ValueError: Received 54 parameters not in model: language_model.model.layers.24.self_attn.k_norm.weight, … for every language_model.model.layers.{24..41}.self_attn.{k_norm,k_proj,v_proj}.weight.
54 = 18 layers × 3 weight names. 18 matches text_config.num_kv_shared_layers: 18 in the HF config — these are KV-sharing layers that should reuse K/V projections from an earlier layer rather than own their own.
The HF safetensors still ship those K/V/k_norm tensors (apparently for round-trip safety), but vllm-metal's Gemma4 class — which has KV-sharing wired up — has no weight attributes for those slots, so mlx.nn.layers.base.load_weights(strict=True) rejects them.
The recently-merged multimodal/text-backbone fix is applying correctly (Metal: forcing text-only backbone for model_type=gemma4 (multimodal_mode=auto, cleared multimodal_config)); the failure is unrelated.
Environment
- vllm-metal:
0.2.0 (release tag v0.2.0-20260424-074018, commit acd70f84, installed via install.sh from main)
- mlx-lm: 0.31.3
- macOS 25.4.0, Apple Silicon (M-series, 64 GB unified)
- Model:
google/gemma-4-E4B-it (also reproed on mlx-community/gemma-4-e4b-it-bf16)
- Architecture (per
config.json → text_config): num_hidden_layers: 42, num_kv_shared_layers: 18, head_dim: 256, global_head_dim: 512
Repro
# Install vllm-metal latest
curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm-metal/main/install.sh | bash
source ~/.venv-vllm-metal/bin/activate
# Get the model
hf download google/gemma-4-E4B-it --local-dir /path/to/Gemma-4-E4B-it
# Serve
vllm serve /path/to/Gemma-4-E4B-it --port 8004
Verbatim error log
(APIServer pid=11396) INFO 04-25 01:50:24 [model.py:549] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=11396) INFO 04-25 01:50:24 [model.py:1678] Using max model len 4096
(APIServer pid=11396) INFO 04-25 01:50:24 [config.py:104] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=11396) INFO 04-25 01:50:24 [model_adapter.py:156] Metal: forcing text-only backbone for model_type=gemma4 (multimodal_mode=auto, cleared multimodal_config)
(EngineCore pid=11417) INFO 04-25 01:50:30 [model_lifecycle.py:168] Loading model: /path/to/Gemma-4-E4B-it (VLM: False)
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] EngineCore failed to start.
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] Traceback (most recent call last):
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] File "/Users/<user>/.venv-vllm-metal/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] File "/Users/<user>/.venv-vllm-metal/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] super().__init__(...)
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] File "/Users/<user>/.venv-vllm-metal/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 114, in __init__
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] self.model_executor = executor_class(vllm_config)
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] File "/Users/<user>/.venv-vllm-metal/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] self._init_executor()
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] File "/Users/<user>/.venv-vllm-metal/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 52, in _init_executor
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] self.driver_worker.load_model()
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] File "/Users/<user>/.venv-vllm-metal/lib/python3.12/site-packages/vllm_metal/v1/worker.py", line 141, in load_model
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] self.model_runner.load_model()
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] File "/Users/<user>/.venv-vllm-metal/lib/python3.12/site-packages/vllm_metal/v1/model_runner.py", line 351, in load_model
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] self._model_lifecycle.load()
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] File "/Users/<user>/.venv-vllm-metal/lib/python3.12/site-packages/vllm_metal/v1/model_lifecycle.py", line 149, in load
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] model, tokenizer = self._load_generation_model(model_name, is_vlm)
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] File "/Users/<user>/.venv-vllm-metal/lib/python3.12/site-packages/vllm_metal/v1/model_lifecycle.py", line 191, in _load_generation_model
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] model, tokenizer = mlx_lm_load(...)
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] File "/Users/<user>/.venv-vllm-metal/lib/python3.12/site-packages/mlx_lm/utils.py", line 491, in load
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] model, config = load_model(model_path, lazy, model_config=model_config)
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] File "/Users/<user>/.venv-vllm-metal/lib/python3.12/site-packages/mlx_lm/utils.py", line 415, in load_model
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] model.load_weights(list(weights.items()), strict=strict)
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] File "/Users/<user>/.venv-vllm-metal/lib/python3.12/site-packages/mlx/nn/layers/base.py", line 185, in load_weights
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] raise ValueError(
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] ValueError: Received 54 parameters not in model:
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] language_model.model.layers.24.self_attn.k_norm.weight,
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] language_model.model.layers.24.self_attn.k_proj.weight,
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] language_model.model.layers.24.self_attn.v_proj.weight,
[... continues for layers 25 through 41 — full list available, omitted for brevity ...]
(EngineCore pid=11417) ERROR 04-25 01:50:30 [core.py:1108] language_model.model.layers.41.self_attn.v_proj.weight.
Summary
Loading
google/gemma-4-E4B-it(ormlx-community/gemma-4-e4b-it-bf16) crashes withValueError: Received 54 parameters not in model: language_model.model.layers.24.self_attn.k_norm.weight, …for everylanguage_model.model.layers.{24..41}.self_attn.{k_norm,k_proj,v_proj}.weight.54 = 18 layers × 3 weight names. 18 matches
text_config.num_kv_shared_layers: 18in the HF config — these are KV-sharing layers that should reuse K/V projections from an earlier layer rather than own their own.The HF safetensors still ship those K/V/k_norm tensors (apparently for round-trip safety), but vllm-metal's Gemma4 class — which has KV-sharing wired up — has no weight attributes for those slots, so
mlx.nn.layers.base.load_weights(strict=True)rejects them.The recently-merged multimodal/text-backbone fix is applying correctly (
Metal: forcing text-only backbone for model_type=gemma4 (multimodal_mode=auto, cleared multimodal_config)); the failure is unrelated.Environment
0.2.0(release tagv0.2.0-20260424-074018, commitacd70f84, installed viainstall.shfrommain)google/gemma-4-E4B-it(also reproed onmlx-community/gemma-4-e4b-it-bf16)config.json→text_config):num_hidden_layers: 42,num_kv_shared_layers: 18,head_dim: 256,global_head_dim: 512Repro
Verbatim error log