Skip to content

Conversation

@ConnorLi96
Copy link
Contributor

@ConnorLi96 ConnorLi96 commented Feb 4, 2026

  • Set explicit output_format=TopKOutputFormat.STANDARD for unquantized GLM4-MoE models
  • Add validation check in triton MoE runner to catch format mismatches early
  • Fixes ValueError when using EAGLE speculative decoding with unquantized GLM4-MoE

Motivation

Fixes ValueError: too many values to unpack (expected 3) when using EAGLE speculative decoding with unquantized GLM4-MoE models. The server crashes immediately after warmup during the first generation request.

File "sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 206, in fused_experts
    topk_weights, topk_ids, _ = topk_output
ValueError: too many values to unpack (expected 3)

Launch command to reproduce:

python3 -m sglang.launch_server \
  --model-path baseten-admin/glm-4.7-fp4 \
  --port 12345 \
  --enable-metrics \
  --tp-size 4 \
  --moe-runner-backend flashinfer_trtllm \
  --attention-backend flashinfer \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --mem-fraction-static 0.8 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --served-model-name baseten-admin/glm-4.7-fp4 \
  --host 0.0.0.0 \
  --context-length 202752 \
  --crash-dump-folder /var/log/sglang/

Hardware: B200

Modifications

Root Cause: GLM4-MoE's TopK layer doesn't explicitly set output_format, causing auto-detection to incorrectly produce BypassedTopKOutput (5 fields) instead of StandardTopKOutput (3 fields) when using EAGLE with unquantized models.

Fix: Set explicit output_format=TopKOutputFormat.STANDARD for unquantized GLM4-MoE models in glm4_moe.py:

output_format=TopKOutputFormat.STANDARD if quant_config is None else None,

Accuracy Tests

✅ Tested unquantized GLM4-MoE with EAGLE speculative decoding (no longer crashes)
✅ Verified FP4 quantized models still work correctly with flashinfer_trtllm backend

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

- Set explicit output_format=TopKOutputFormat.STANDARD for unquantized GLM4-MoE models
- Add validation check in triton MoE runner to catch format mismatches early
- Fixes ValueError when using EAGLE speculative decoding with unquantized GLM4-MoE
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ConnorLi96
Copy link
Contributor Author

/tag-and-rerun-ci

@github-actions github-actions bot added the run-ci label Feb 4, 2026
@ConnorLi96
Copy link
Contributor Author

cc @zRzRzRzRzRzRzR @JustinTong0323 seems like you're working on this model recently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant