Support llama3 eagle3 head with llama4 verifier #117

rahul-tuli · 2025-09-24T13:40:02Z

Summary

This PR enables Eagle3 speculative decoding with Llama3 drafter and Llama4 multimodal verifier support, with configurable auxiliary hidden state layers.

Key Features

Compatibility: Llama3 Eagle3 drafter can now work with Llama4 verifier models
Configurable auxiliary layers: Hidden state layer indices can be specified via eagle_aux_hidden_state_layer_ids in the speculator config, allowing non-default layer selection for optimal performance across different model architectures

Configuration

Auxiliary layer indices can be set in the Eagle3 draft model config:

{
  "eagle_aux_hidden_state_layer_ids": [1, 23, 44]
}

This enables using hidden states from non-default layers (e.g., layers 1, 23, 44 instead of default 2, 23, 44) for cross-architecture scenarios where different layer combinations may work better.

Testing

Command:

python examples/offline_inference/spec_decode.py \
  --method "eagle3" \
  --tp 8 \
  --print-output \
  --model-dir "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16" \
  --eagle-dir "nm-testing/Llama4-Maverick-Eagle3-Speculators" \
  --dataset_name "hf" \
  --dataset_path "philschmid/mt-bench" \
  --num-spec-tokens 3

Results:

Mean acceptance length: 2.53
Per-position acceptance rates: 0.71, 0.48, 0.34
Auxiliary layers used: [1, 23, 44] (configured via speculator config)

--------------------------------------------------
--------------------------------------------------
total_num_output_tokens: 227215
num_drafts: 90393
num_draft_tokens: 271179
num_accepted_tokens: 136677
mean acceptance length: 2.53
--------------------------------------------------
acceptance at token 0: 0.71
acceptance at token 1: 0.48
acceptance at token 2: 0.34

Support configuring eagle_aux_hidden_state_layer_ids and inference_type in the Eagle3 speculator configuration. This allows users to specify which verifier layers should output auxiliary hidden states for the drafter to consume during speculative decoding. Signed-off-by: rahul-tuli <[email protected]> Signed-off-by: Rahul Tuli <[email protected]>

Add documentation explaining that get_eagle3_aux_hidden_state_layers() provides default layer selection and that the GPU model runner can override this with values from speculative config for dynamic configuration. Signed-off-by: rahul-tuli <[email protected]> Signed-off-by: Rahul Tuli <[email protected]>

Add Eagle3 support to Llama4ForConditionalGeneration by implementing set_aux_hidden_state_layers() and get_eagle3_aux_hidden_state_layers() methods. Both methods delegate to the underlying Llama4ForCausalLM language model, enabling Eagle3 speculative decoding with Llama4 multimodal verifier models. This allows text-only Eagle3 drafters to work with Llama4 multimodal verifiers by consuming auxiliary hidden states from specified layers. Signed-off-by: rahul-tuli <[email protected]> Signed-off-by: Rahul Tuli <[email protected]>

Implement custom get_input_embeddings() in Eagle3LlamaForCausalLM that accepts multimodal parameters but only processes text embeddings. This ensures the Llama3-based Eagle3 drafter correctly handles text inputs while remaining compatible with multimodal verifier interfaces. The drafter receives multimodal context through auxiliary hidden states from the verifier rather than processing multimodal inputs directly. Signed-off-by: rahul-tuli <[email protected]> Signed-off-by: Rahul Tuli <[email protected]>

Implement _get_eagle3_aux_layers_from_config() helper method to extract auxiliary layer IDs from the draft model's speculative config. The GPU model runner now prefers config-specified layers over model defaults, with fallback to model's get_eagle3_aux_hidden_state_layers() when not configured. Changes: - Refactor auxiliary layer setup with early return pattern for errors - Add config extraction with proper error handling - Log only when using non-default layer configuration - Enable dynamic layer configuration per deployment Signed-off-by: rahul-tuli <[email protected]> Signed-off-by: Rahul Tuli <[email protected]>

Signed-off-by: Rahul Tuli <[email protected]>

rahul-tuli force-pushed the support-llama3-eagle3-head-with-llama4-verifier branch 10 times, most recently from cf02c8d to 1695608 Compare September 30, 2025 15:30

rahul-tuli force-pushed the support-llama3-eagle3-head-with-llama4-verifier branch 3 times, most recently from 1f6fd40 to 5e93541 Compare October 3, 2025 08:45

rahul-tuli and others added 7 commits October 6, 2025 13:06

Review comments

1c1d679

Signed-off-by: Rahul Tuli <[email protected]>

Use get_input_embeddings

1037b36

Signed-off-by: Rahul Tuli <[email protected]>

rahul-tuli force-pushed the support-llama3-eagle3-head-with-llama4-verifier branch from cac1941 to 1037b36 Compare October 6, 2025 13:10

mgoin force-pushed the main branch 2 times, most recently from 0340f45 to 2f7dbc9 Compare October 6, 2025 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support llama3 eagle3 head with llama4 verifier #117

Support llama3 eagle3 head with llama4 verifier #117

Uh oh!

rahul-tuli commented Sep 24, 2025 •

edited by github-actions bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Support llama3 eagle3 head with llama4 verifier #117

Are you sure you want to change the base?

Support llama3 eagle3 head with llama4 verifier #117

Uh oh!

Conversation

rahul-tuli commented Sep 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

Configuration

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rahul-tuli commented Sep 24, 2025 •

edited by github-actions bot

Loading