[Bug] Fix GLM4-MoE TopK output format mismatch for unquantized models #18206
+2
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Fixes ValueError: too many values to unpack (expected 3) when using EAGLE speculative decoding with unquantized GLM4-MoE models. The server crashes immediately after warmup during the first generation request.
Launch command to reproduce:
Hardware: B200
Modifications
Root Cause: GLM4-MoE's TopK layer doesn't explicitly set
output_format, causing auto-detection to incorrectly produceBypassedTopKOutput(5 fields) instead ofStandardTopKOutput(3 fields) when using EAGLE with unquantized models.Fix: Set explicit
output_format=TopKOutputFormat.STANDARDfor unquantized GLM4-MoE models inglm4_moe.py:Accuracy Tests
✅ Tested unquantized GLM4-MoE with EAGLE speculative decoding (no longer crashes)
✅ Verified FP4 quantized models still work correctly with flashinfer_trtllm backend
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci