fix: MLLM vision models hallucinate and ignore instructions in BatchedEngine by janhilgard · Pull Request #54 · waybarrios/vllm-mlx

janhilgard · 2026-02-09T16:22:42Z

Summary

Fix MLLM models (Qwen3-VL, etc.) ignoring system messages, hallucinating random content, and not stopping at EOS in BatchedEngine (continuous batching) mode
Four root causes identified and fixed across chat template, EOS handling, sampling, and KV cache propagation

Problem

When running vision models via BatchedEngine with --continuous-batching, the model would:

Ignore system messages ("Return only JSON") and generate random code/text
Not stop generating (missing EOS token detection)
Hallucinate unrelated content instead of describing the image
Work correctly only in SimpleEngine mode

Root Causes & Fixes

1. Chat template dropped system messages and multi-turn context

_apply_chat_template() used mlx_vlm.apply_chat_template() which extracted only the last user message's text. All system prompts, formatting instructions, and conversation history were lost.

Fix: Use tokenizer.apply_chat_template() with the full message structure, preserving system messages, multi-turn history, and image placeholders.

2. Qwen3 EOS token fix missing for MLLM path

The eos_token = "<|im_end|>" fix existed in _start_llm() but was absent from _start_mllm(). Without the correct EOS token, the model wouldn't stop generating.

Fix: Apply the same Qwen3 EOS token fix in _start_mllm().

3. `top_p` not forwarded in SimpleEngine MLLM path

chat() and stream_chat() in MLXMultimodalLM lacked a top_p parameter, and SimpleEngine didn't pass it for MLLM branches.

Fix: Add top_p parameter to MLLM chat()/stream_chat() signatures and forward it through to mlx_vlm.generate()/stream_generate().

4. BatchedEngine KV cache not populated during vision encoding (critical)

_run_vision_encoding() called self.model(input_ids, **kwargs) without a cache parameter. The KV states from the VLM forward pass were discarded. Subsequent generation used an empty BatchKVCache, so the model had no context from the prompt or image — causing pure hallucination.

Additionally, input_ids from prepare_inputs had shape (1, N), and tolist() on it returned [[...]], making len(ids) == 1 instead of N, which broke padding calculations.

Fix:

Create per-request KVCache and pass it to self.model(input_ids, cache=per_cache, **kwargs)
After vision encoding all requests, merge per-request caches via BatchKVCache.merge()
Squeeze 2D input_ids to 1D for correct length computation

Files changed

File	Changes
`vllm_mlx/engine/batched.py`	Chat template rewrite + Qwen3 EOS fix for MLLM
`vllm_mlx/engine/simple.py`	Forward `top_p` in MLLM chat/stream branches
`vllm_mlx/models/mllm.py`	Add `top_p` to `chat()` and `stream_chat()`
`vllm_mlx/mllm_batch_generator.py`	KV cache propagation + input_ids shape fix

Test plan

pytest tests/test_mllm.py — 15/15 passed
uvx black — all files formatted
BatchedEngine: red solid image → "red" (was random hallucination)
BatchedEngine: blue solid image → "blue"
BatchedEngine: system message "Return JSON" → {"color": "red"}
BatchedEngine: OCR image with "Hello World" → "Hello World"
SimpleEngine: same tests pass
EOS: finish_reason: "stop" (was "length")

🤖 Generated with Claude Code

SimpleEngine MLLM paths and mllm.py chat()/stream_chat() methods were missing top_p forwarding, causing generation to always use the default value instead of the user-specified one. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

janhilgard force-pushed the fix/mllm-vision-hallucination branch from 7047ed6 to 7eb93a3 Compare February 13, 2026 08:43

janhilgard force-pushed the fix/mllm-vision-hallucination branch from 7eb93a3 to ef1f3e9 Compare February 13, 2026 08:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix: MLLM vision models hallucinate and ignore instructions in BatchedEngine#54

fix: MLLM vision models hallucinate and ignore instructions in BatchedEngine#54
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:fix/mllm-vision-hallucination

janhilgard commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

janhilgard commented Feb 9, 2026

Summary

Problem

Root Causes & Fixes

1. Chat template dropped system messages and multi-turn context

2. Qwen3 EOS token fix missing for MLLM path

3. top_p not forwarded in SimpleEngine MLLM path

4. BatchedEngine KV cache not populated during vision encoding (critical)

Files changed

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

3. `top_p` not forwarded in SimpleEngine MLLM path