fix: MLLM vision models hallucinate and ignore instructions in BatchedEngine#54
Open
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
Open
fix: MLLM vision models hallucinate and ignore instructions in BatchedEngine#54janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
Conversation
7047ed6 to
7eb93a3
Compare
SimpleEngine MLLM paths and mllm.py chat()/stream_chat() methods were missing top_p forwarding, causing generation to always use the default value instead of the user-specified one. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7eb93a3 to
ef1f3e9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Problem
When running vision models via
BatchedEnginewith--continuous-batching, the model would:SimpleEnginemodeRoot Causes & Fixes
1. Chat template dropped system messages and multi-turn context
_apply_chat_template()usedmlx_vlm.apply_chat_template()which extracted only the last user message's text. All system prompts, formatting instructions, and conversation history were lost.Fix: Use
tokenizer.apply_chat_template()with the full message structure, preserving system messages, multi-turn history, and image placeholders.2. Qwen3 EOS token fix missing for MLLM path
The
eos_token = "<|im_end|>"fix existed in_start_llm()but was absent from_start_mllm(). Without the correct EOS token, the model wouldn't stop generating.Fix: Apply the same Qwen3 EOS token fix in
_start_mllm().3.
top_pnot forwarded in SimpleEngine MLLM pathchat()andstream_chat()inMLXMultimodalLMlacked atop_pparameter, andSimpleEnginedidn't pass it for MLLM branches.Fix: Add
top_pparameter to MLLMchat()/stream_chat()signatures and forward it through tomlx_vlm.generate()/stream_generate().4. BatchedEngine KV cache not populated during vision encoding (critical)
_run_vision_encoding()calledself.model(input_ids, **kwargs)without acacheparameter. The KV states from the VLM forward pass were discarded. Subsequent generation used an emptyBatchKVCache, so the model had no context from the prompt or image — causing pure hallucination.Additionally,
input_idsfromprepare_inputshad shape(1, N), andtolist()on it returned[[...]], makinglen(ids) == 1instead ofN, which broke padding calculations.Fix:
KVCacheand pass it toself.model(input_ids, cache=per_cache, **kwargs)BatchKVCache.merge()input_idsto 1D for correct length computationFiles changed
vllm_mlx/engine/batched.pyvllm_mlx/engine/simple.pytop_pin MLLM chat/stream branchesvllm_mlx/models/mllm.pytop_ptochat()andstream_chat()vllm_mlx/mllm_batch_generator.pyTest plan
pytest tests/test_mllm.py— 15/15 passeduvx black— all files formatted{"color": "red"}finish_reason: "stop"(was"length")🤖 Generated with Claude Code