Skip to content

Conversation

@yuan-luo
Copy link
Collaborator

@yuan-luo yuan-luo commented Nov 19, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Summary by Sourcery

Add support for piecewise CUDA graph execution for multimodal Qwen2.5-VL (and related QwenVL models) by externalizing multimodal preprocessing and extending graph runners to handle embeddings directly.

New Features:

  • Introduce multimodal_preprocess_routine with resolve_language_model and should_use_external_mm_preprocess for external piecewise CUDA graph preprocessing.
  • Extend piecewise_cuda_graph_runner and cuda_graph_runner to use input_embeds and input_deepstack_embeds for multimodal inputs.
  • Update Qwen2.5-VL, Qwen2-VL, and Qwen3-VL model classes to accept input_embeds and deepstack embeddings and delegate to new post_process logic.

Enhancements:

  • Refactor embed_mm_inputs and general_mm_embed_routine to remove placeholder_tokens, consolidate deepstack handling into post_process, and return forward_batch.
  • Adjust rotary embedding layer to compute sequence length dynamically and remove static num_tokens dependency.
  • Modify model_runner forward_extend to invoke multimodal_preprocess_routine for multimodal models.

Tests:

  • Add TestPiecewiseCudaGraphQwen25VL and expand nightly integration test suite for VLM piecewise CUDA graph and MMMU benchmark evaluations.
  • Increase timeout and test file size in run_suite to include new tests.

@sourcery-ai
Copy link

sourcery-ai bot commented Nov 19, 2025

Reviewer's Guide

This PR refactors the multimodal embedding pipeline and extends both piecewise and full CUDA graph runners to support Qwen2.5-VL by introducing external preprocessing of multimodal inputs, passing precomputed input_embeds (and deepstack embeddings) through graph capture and replay, updating model wrappers to consume these tensors, and adding targeted tests for the new flow.

Sequence diagram for external multimodal preprocessing and input_embeds flow

sequenceDiagram
  participant ModelRunner
  participant "multimodal_preprocess_routine()"
  participant "embed_mm_inputs()"
  participant Model
  participant CUDA_Graph_Runner

  ModelRunner->>"multimodal_preprocess_routine()": If is_multimodal, preprocess batch
  "multimodal_preprocess_routine()"->>"embed_mm_inputs()": Compute input_embeds, deepstack_embeds
  "embed_mm_inputs()"-->>"multimodal_preprocess_routine()": Return input_embeds, update forward_batch
  "multimodal_preprocess_routine()"-->>ModelRunner: Return updated forward_batch
  ModelRunner->>CUDA_Graph_Runner: Pass forward_batch with input_embeds
  CUDA_Graph_Runner->>Model: Call forward(input_embeds, ...)
  Model-->>CUDA_Graph_Runner: Return hidden_states
Loading

Class diagram for updated ForwardBatch and model wrappers

classDiagram
  class ForwardBatch {
    +torch.Tensor input_ids
    +torch.Tensor input_embeds
    +torch.Tensor mrope_positions
    +Optional[torch.Tensor] input_deepstack_embeds
    ...
  }

  class Qwen2VLForConditionalGeneration {
    +forward(input_ids, positions, forward_batch, input_embeds=None, get_embedding=False)
  }

  class Qwen2_5_VLForConditionalGeneration {
    +forward(input_ids, positions, forward_batch, input_embeds=None, get_embedding=False, pp_proxy_tensors=None)
    +post_process(inputs_embeds, modalities, embeddings, indices, forward_batch)
  }

  class Qwen3VLForConditionalGeneration {
    +forward(input_ids, positions, forward_batch, input_embeds=None, get_embedding=False)
    +post_process(inputs_embeds, modalities, embeddings, indices, forward_batch)
  }

  class Qwen3OmniMoeForConditionalGeneration {
    +get_audio_feature()
    +get_image_feature()
    +get_video_feature()
    +get_input_embeddings()
    +post_process()
  }

  Qwen3OmniMoeForConditionalGeneration --> Qwen3VLForConditionalGeneration : delegates methods
Loading

File-Level Changes

Change Details Files
Refactor multimodal embedding utility to externalize preprocessing and integrate with forward_batch
  • Remove placeholder_tokens/use_deepstack parameters from embed_mm_inputs
  • Compute index masks for embeddings and integrate multimodal_model.post_process
  • Return updated forward_batch with input_embeds
  • Add resolve_language_model and should_use_external_mm_preprocess helpers
  • Introduce multimodal_preprocess_routine to run before model forward
python/sglang/srt/managers/mm_utils.py
Inject external multimodal preprocessing in ModelRunner
  • Call multimodal_preprocess_routine before forward_extend for multimodal models
  • Clear input_deepstack_embeds after raw forward
python/sglang/srt/model_executor/model_runner.py
Extend piecewise CUDA graph runner to handle input_embeds and deepstack embeddings
  • Add input_embeds and optional input_deepstack_embeds tensors in init
  • Toggle between input_ids and input_embeds during warmup_and_capture and capture_one_batch_size
  • Copy mrope_positions and deepstack buffers in replay_prepare
python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py
Mirror input_embeds support in full CUDA graph runner
  • Add input_embeds buffer initialization alongside input_ids
python/sglang/srt/model_executor/cuda_graph_runner.py
Extend ForwardBatch to carry deepstack embeddings
  • Add input_deepstack_embeds attribute
python/sglang/srt/model_executor/forward_batch_info.py
Update Qwen2/VL family model wrappers to accept and forward input_embeds and deepstack buffers
  • Add input_embeds parameter and override in forward methods
  • Assign forward_batch.input_embeds and input_deepstack_embeds to model calls
  • Implement post_process in qwen3_vl and placeholder in qwen2_5_vl/qwen3_omni_moe
python/sglang/srt/models/qwen3_vl.py
python/sglang/srt/models/qwen2_5_vl.py
python/sglang/srt/models/qwen2_vl.py
python/sglang/srt/models/qwen3_omni_moe.py
Fix rotary embedding to use dynamic sequence lengths
  • Replace fixed num_tokens reshape with runtime seq_len from tensor shape
python/sglang/srt/layers/rotary_embedding.py
Add and update tests for piecewise CUDA graph on Qwen2.5-VL
  • Extend existing test_piecewise_cuda_graph timeout and add TestPiecewiseCudaGraphQwen25VL
  • Update run_suite to include nightly test
  • Add nightly test_vlms_piecewise_cuda_graph.py covering MMMU benchmark
test/srt/test_piecewise_cuda_graph.py
test/srt/run_suite.py
test/srt/nightly/test_vlms_piecewise_cuda_graph.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@yuan-luo yuan-luo marked this pull request as draft November 19, 2025 09:15
@gemini-code-assist
Copy link

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly advances the integration of Piecewise CUDA Graph functionality for various Qwen-VL models, including Qwen2.5-VL, Qwen2-VL, and Qwen3-VL. The core of these changes involves a strategic refactoring of multimodal input processing, moving the embedding generation to an external step to optimize the capture of the language model's forward pass by CUDA graphs. This approach is designed to boost performance by reducing overhead for repetitive computations. Additionally, the update includes specialized support for Qwen3-VL's 'deepstack' embeddings and enhances the CUDA graph infrastructure to accommodate direct input embeddings. The functionality is validated through new comprehensive test suites.

Highlights

  • Piecewise CUDA Graph Support for Qwen-VL Models: This pull request introduces support for Piecewise CUDA Graph functionality across Qwen2.5-VL, Qwen2-VL, and Qwen3-VL models. This aims to enhance performance by capturing stable computational patterns within the model's execution flow.
  • Refactored Multimodal Input Preprocessing: The logic for embedding multimodal inputs (like images and videos) has been refactored to occur as an external preprocessing step before the main language model forward pass. This change is crucial for allowing the language model's operations to be more effectively captured by CUDA graphs.
  • Qwen3-VL Deepstack Embedding Handling: Specific handling for Qwen3-VL's unique 'deepstack' embeddings has been integrated into the new external multimodal preprocessing routine and the ForwardBatch object, ensuring compatibility and correct processing within the CUDA graph framework.
  • Flexible CUDA Graph Inputs: The CUDA graph runners have been updated to accept pre-computed input embeddings directly, rather than solely relying on token IDs. This flexibility is essential for multimodal models where embeddings are generated from non-textual inputs.
  • New Test Suites for Validation: New nightly test suites have been added to benchmark VLM models (specifically Qwen2.5-VL) with piecewise CUDA graph enabled, including accuracy verification on the MMMU benchmark and GSM8K.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • There’s a lot of duplicated multimodal post-processing and forward logic across qwen2_vl, qwen2_5_vl, and qwen3_vl—consider refactoring this into a shared base or helper to reduce code duplication and improve consistency.
  • The embed_mm_inputs/general_mm_embed_routine signatures have grown with many new parameters; consider encapsulating these into a config object or dataclass to improve readability and avoid mismatched arguments.
  • The new nightly integration test for piecewise CUDA graph is quite large and may significantly increase CI runtime—consider splitting it into smaller focused tests or marking it as a slow suite to avoid blocking regular PR feedback.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- There’s a lot of duplicated multimodal post-processing and forward logic across qwen2_vl, qwen2_5_vl, and qwen3_vl—consider refactoring this into a shared base or helper to reduce code duplication and improve consistency.
- The embed_mm_inputs/general_mm_embed_routine signatures have grown with many new parameters; consider encapsulating these into a config object or dataclass to improve readability and avoid mismatched arguments.
- The new nightly integration test for piecewise CUDA graph is quite large and may significantly increase CI runtime—consider splitting it into smaller focused tests or marking it as a slow suite to avoid blocking regular PR feedback.

## Individual Comments

### Comment 1
<location> `python/sglang/srt/managers/mm_utils.py:681-682` </location>
<code_context>
     else:
         inputs_embeds = None

+    if skip_llm_forward:
+        return inputs_embeds
+
     hidden_states = language_model(
</code_context>

<issue_to_address>
**issue (bug_risk):** Returning only 'inputs_embeds' when 'skip_llm_forward' is True may break downstream code expecting a tuple.

To prevent runtime errors, ensure the return type remains consistent, or update documentation to reflect the change.
</issue_to_address>

### Comment 2
<location> `python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py:241-243` </location>
<code_context>
                    torch.randint(0, 100, (num_tokens,), device=self.device)
                    if not self.use_input_embeds
                    else None

</code_context>

<issue_to_address>
**suggestion (code-quality):** Swap if/else branches of if expression to remove negation ([`swap-if-expression`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/swap-if-expression))

```suggestion
                    None if self.use_input_embeds else torch.randint(0, 100, (num_tokens,), device=self.device)

```

<br/><details><summary>Explanation</summary>Negated conditions are more difficult to read than positive ones, so it is best
to avoid them where we can. By swapping the `if` and `else` conditions around we
can invert the condition and make it positive.
</details>
</issue_to_address>

### Comment 3
<location> `test/srt/nightly/test_vlms_piecewise_cuda_graph.py:162-164` </location>
<code_context>
            result_files = glob.glob(f"{output_path}/**/*.json", recursive=True)
            if not result_files:
                result_files = glob.glob(f"{output_path}/*.json")

</code_context>

<issue_to_address>
**suggestion (code-quality):** Use `or` for providing a fallback value ([`use-or-for-fallback`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/use-or-for-fallback))

```suggestion
            result_files = glob.glob(f"{output_path}/**/*.json", recursive=True) or glob.glob(f"{output_path}/*.json")

```

<br/><details><summary>Explanation</summary>Thanks to the flexibility of Python's `or` operator, you can use a single
assignment statement, even if a variable can retrieve its value from various
sources. This is shorter and easier to read than using multiple assignments with
`if not` conditions.
</details>
</issue_to_address>

### Comment 4
<location> `test/srt/nightly/test_vlms_piecewise_cuda_graph.py:242-243` </location>
<code_context>

</code_context>

<issue_to_address>
**issue (code-quality):** Avoid conditionals in tests. ([`no-conditionals-in-tests`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/no-conditionals-in-tests))

<details><summary>Explanation</summary>Avoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
* loops
* conditionals

Some ways to fix this:

* Use parametrized tests to get rid of the loop.
* Move the complex logic into helpers.
* Move the complex part into pytest fixtures.

> Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / [Don't Put Logic in Tests](https://abseil.io/resources/swe-book/html/ch12.html#donapostrophet_put_logic_in_tests)
</details>
</issue_to_address>

### Comment 5
<location> `test/srt/nightly/test_vlms_piecewise_cuda_graph.py:245-246` </location>
<code_context>

</code_context>

<issue_to_address>
**issue (code-quality):** Avoid loops in tests. ([`no-loop-in-tests`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/no-loop-in-tests))

<details><summary>Explanation</summary>Avoid complex code, like loops, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
* loops
* conditionals

Some ways to fix this:

* Use parametrized tests to get rid of the loop.
* Move the complex logic into helpers.
* Move the complex part into pytest fixtures.

> Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / [Don't Put Logic in Tests](https://abseil.io/resources/swe-book/html/ch12.html#donapostrophet_put_logic_in_tests)
</details>
</issue_to_address>

### Comment 6
<location> `python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py:554` </location>
<code_context>
    def replay_prepare(
        self,
        forward_batch: ForwardBatch,
        **kwargs,
    ):
        if self.use_input_embeds:
            num_tokens = forward_batch.input_embeds.shape[0]
        else:
            num_tokens = len(forward_batch.input_ids)

        if self.use_deepstack:
            self.input_deepstack_embeds.zero_()  # may be removed.

        index = bisect.bisect_left(self.capture_num_tokens, num_tokens)
        static_num_tokens = self.capture_num_tokens[index]
        self.raw_num_tokens = num_tokens
        if static_num_tokens != num_tokens:
            self.out_cache_loc.zero_()
            self.out_cache_loc_swa.zero_()
        bs = forward_batch.batch_size

        if self.use_input_embeds:
            self.input_embeds[:num_tokens].copy_(forward_batch.input_embeds)
        else:
            self.input_ids[:num_tokens].copy_(forward_batch.input_ids)

        self.positions[:num_tokens].copy_(forward_batch.positions)
        self.out_cache_loc[:num_tokens].copy_(forward_batch.out_cache_loc)
        if forward_batch.out_cache_loc_swa is not None:
            self.out_cache_loc_swa[:num_tokens].copy_(forward_batch.out_cache_loc_swa)
        input_ids = self.input_ids[:static_num_tokens]
        positions = self.positions[:static_num_tokens]
        out_cache_loc = self.out_cache_loc[:static_num_tokens]

        out_cache_loc_swa = (
            self.out_cache_loc_swa[:static_num_tokens]
            if forward_batch.out_cache_loc_swa is not None
            else None
        )

        if forward_batch.mrope_positions is not None:
            self.mrope_positions[:, :num_tokens].copy_(forward_batch.mrope_positions)

        if self.use_input_embeds:
            input_ids = None
            input_embeds = self.input_embeds[:static_num_tokens]
        else:
            input_ids = self.input_ids[:static_num_tokens]
            input_embeds = None

        positions = self.positions[:static_num_tokens]
        out_cache_loc = self.out_cache_loc[:static_num_tokens]

        mrope_positions = (
            self.mrope_positions[:, :static_num_tokens]
            if forward_batch.mrope_positions is not None
            else None
        )

        next_token_logits_buffer = None

        input_deepstack_embeds = None
        if self.use_deepstack:
            self.input_deepstack_embeds[:num_tokens].copy_(
                forward_batch.input_deepstack_embeds
            )
            input_deepstack_embeds = self.input_deepstack_embeds[:static_num_tokens]

        static_forward_batch = ForwardBatch(
            forward_mode=forward_batch.forward_mode,
            batch_size=bs,
            input_ids=input_ids,
            input_embeds=input_embeds,
            req_pool_indices=forward_batch.req_pool_indices,
            seq_lens=forward_batch.seq_lens,
            next_token_logits_buffer=next_token_logits_buffer,
            orig_seq_lens=forward_batch.orig_seq_lens,
            seq_lens_cpu=forward_batch.seq_lens_cpu,
            req_to_token_pool=self.model_runner.req_to_token_pool,
            token_to_kv_pool=self.model_runner.token_to_kv_pool,
            attn_backend=self.model_runner.attn_backend,
            out_cache_loc=out_cache_loc,
            out_cache_loc_swa=out_cache_loc_swa,
            seq_lens_sum=forward_batch.seq_lens_sum,
            encoder_lens=forward_batch.encoder_lens,
            return_logprob=False,
            extend_seq_lens=forward_batch.extend_seq_lens,
            extend_prefix_lens=forward_batch.extend_prefix_lens,
            extend_start_loc=forward_batch.extend_start_loc,
            extend_prefix_lens_cpu=forward_batch.extend_prefix_lens_cpu,
            extend_seq_lens_cpu=forward_batch.extend_seq_lens_cpu,
            extend_logprob_start_lens_cpu=forward_batch.extend_logprob_start_lens_cpu,
            extend_num_tokens=forward_batch.extend_num_tokens,
            extend_input_logprob_token_ids_gpu=forward_batch.extend_input_logprob_token_ids_gpu,
            positions=positions,
            global_num_tokens_gpu=forward_batch.global_num_tokens_gpu,
            global_num_tokens_for_logprob_gpu=forward_batch.global_num_tokens_for_logprob_gpu,
            dp_padding_mode=forward_batch.dp_padding_mode,
            global_dp_buffer_len=forward_batch.global_dp_buffer_len,
            mrope_positions=mrope_positions,
            spec_algorithm=forward_batch.spec_algorithm,
            spec_info=forward_batch.spec_info,
            capture_hidden_mode=forward_batch.capture_hidden_mode,
            num_token_non_padded=forward_batch.num_token_non_padded,
            global_forward_mode=forward_batch.global_forward_mode,
            lora_ids=forward_batch.lora_ids,
            sampling_info=forward_batch.sampling_info,
            mm_inputs=forward_batch.mm_inputs,
            temp_scaled_logprobs=forward_batch.temp_scaled_logprobs,
            temperature=forward_batch.temperature,
            top_p_normalized_logprobs=forward_batch.top_p_normalized_logprobs,
            top_p=forward_batch.top_p,
            input_deepstack_embeds=input_deepstack_embeds,
        )

        return static_forward_batch

</code_context>

<issue_to_address>
**issue (code-quality):** Inline variable that is immediately returned ([`inline-immediately-returned-variable`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/inline-immediately-returned-variable/))
</issue_to_address>

### Comment 7
<location> `python/sglang/srt/models/qwen2_5_vl.py:578-586` </location>
<code_context>
    def post_process(
        self,
        inputs_embeds,
        modalities: List[Modality],
        embeddings: List[torch.Tensor],
        indices: List[torch.Tensor],
        forward_batch: ForwardBatch,
    ) -> torch.Tensor:
        # Placeholder for post_process
        new_embeddings = []
        for i, (modality, embedding, index) in enumerate(
            zip(modalities, embeddings, indices)
        ):
            if embedding is None or index is None:
                continue

            new_embeddings.append(embedding)
        return new_embeddings, forward_batch

</code_context>

<issue_to_address>
**suggestion (code-quality):** We've found these issues:

- Convert for loop into list comprehension ([`list-comprehension`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/list-comprehension/))
- Remove unnecessary calls to `enumerate` when the index is not used ([`remove-unused-enumerate`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-unused-enumerate/))
- Lift code into else after jump in control flow ([`reintroduce-else`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/reintroduce-else/))
- Remove redundant continue statement ([`remove-redundant-continue`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-redundant-continue/))

```suggestion
        new_embeddings = [
            embedding
            for modality, embedding, index in zip(modalities, embeddings, indices)
            if embedding is not None and index is not None
        ]
```
</issue_to_address>

### Comment 8
<location> `test/srt/nightly/test_vlms_piecewise_cuda_graph.py:85` </location>
<code_context>
    def run_mmmu_eval(
        self,
        model_version: str,
        output_path: str,
        *,
        env: dict | None = None,
    ):
        """
        Evaluate a VLM on the MMMU validation set with lmms‑eval.
        Only `model_version` (checkpoint) and `chat_template` vary;
        We are focusing only on the validation set due to resource constraints.
        """
        # -------- fixed settings --------
        model = "openai_compatible"
        tp = 1
        tasks = "mmmu_val"
        batch_size = 32
        log_suffix = "openai_compatible"
        os.makedirs(output_path, exist_ok=True)

        # -------- compose --model_args --------
        model_args = f'model_version="{model_version}",' f"tp={tp}"

        # -------- build command list --------
        cmd = [
            "python3",
            "-m",
            "lmms_eval",
            "--model",
            model,
            "--model_args",
            model_args,
            "--tasks",
            tasks,
            "--batch_size",
            str(batch_size),
            "--output_path",
            str(output_path),
        ]

        subprocess.run(
            cmd,
            check=True,
            timeout=3600,
        )

</code_context>

<issue_to_address>
**suggestion (code-quality):** Remove unnecessary casts to int, str, float or bool ([`remove-unnecessary-cast`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-unnecessary-cast/))

```suggestion
            output_path,
```
</issue_to_address>

### Comment 9
<location> `test/srt/nightly/test_vlms_piecewise_cuda_graph.py:123` </location>
<code_context>
    def _run_vlm_mmmu_test(
        self,
        model,
        output_path,
        test_name="",
        custom_env=None,
        log_level="info",
        capture_output=False,
    ):
        """
        Common method to run VLM MMMU benchmark test.
        Args:
            model: Model to test
            output_path: Path for output logs
            test_name: Optional test name for logging
            custom_env: Optional custom environment variables
            log_level: Log level for server (default: "info")
            capture_output: Whether to capture server stdout/stderr
        """
        print(f"\nTesting model: {model.model}{test_name}")

        process = None
        mmmu_accuracy = 0  # Initialize to handle potential exceptions
        server_output = ""

        try:
            # Prepare environment variables
            process_env = os.environ.copy()
            if custom_env:
                process_env.update(custom_env)
            # if test vlm with cuda_ipc feature, open this env_var
            process_env["SGLANG_USE_CUDA_IPC_TRANSPORT"] = "1"

            # Prepare stdout/stderr redirection if needed
            stdout_file = None
            stderr_file = None
            if capture_output:
                stdout_file = open("/tmp/server_stdout.log", "w")
                stderr_file = open("/tmp/server_stderr.log", "w")

            # Launch server for testing
            process = popen_launch_server(
                model.model,
                base_url=self.base_url,
                timeout=self.time_out,
                api_key=self.api_key,
                other_args=[
                    "--trust-remote-code",
                    "--piecewise-cuda-graph-max-tokens",
                    "8192",
                    "--enable-piecewise-cuda-graph",
                    "--tp=8",
                    "--piecewise-cuda-graph-compiler=eager",
                    "--disable-radix-cache",
                    "--log-level",
                    log_level,
                ],
                env=process_env,
                return_stdout_stderr=(
                    (stdout_file, stderr_file) if capture_output else None
                ),
            )

            # Run evaluation
            self.run_mmmu_eval(model.model, output_path)

            # Get the result file
            # Search recursively for JSON result files (lmms-eval v0.4.1+ creates subdirectories)
            result_files = glob.glob(f"{output_path}/**/*.json", recursive=True)
            if not result_files:
                result_files = glob.glob(f"{output_path}/*.json")

            if not result_files:
                raise FileNotFoundError(f"No JSON result files found in {output_path}")

            result_file_path = result_files[0]

            with open(result_file_path, "r") as f:
                result = json.load(f)
                print(f"Result{test_name}\n: {result}")

            # Process the result
            mmmu_accuracy = result["results"]["mmmu_val"]["mmmu_acc,none"]
            print(
                f"Model {model.model} achieved accuracy{test_name}: {mmmu_accuracy:.4f}"
            )

            # Capture server output if requested
            if capture_output and process:
                server_output = self._read_output_from_files()

            # Assert performance meets expected threshold
            self.assertGreaterEqual(
                mmmu_accuracy,
                model.mmmu_accuracy,
                f"Model {model.model} accuracy ({mmmu_accuracy:.4f}) below expected threshold ({model.mmmu_accuracy:.4f}){test_name}",
            )

            return server_output

        except Exception as e:
            print(f"Error testing {model.model}{test_name}: {e}")
            self.fail(f"Test failed for {model.model}{test_name}: {e}")

        finally:
            # Ensure process cleanup happens regardless of success/failure
            if process is not None and process.poll() is None:
                print(f"Cleaning up process {process.pid}")
                try:
                    kill_process_tree(process.pid)
                except Exception as e:
                    print(f"Error killing process: {e}")

            # clean up temporary files
            if capture_output:
                if stdout_file:
                    stdout_file.close()
                if stderr_file:
                    stderr_file.close()
                for filename in ["/tmp/server_stdout.log", "/tmp/server_stderr.log"]:
                    try:
                        if os.path.exists(filename):
                            os.remove(filename)
                    except Exception as e:
                        print(f"Error removing {filename}: {e}")

</code_context>

<issue_to_address>
**suggestion (code-quality):** We've found these issues:

- Merge dictionary updates via the union operator ([`dict-assign-update-to-union`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/dict-assign-update-to-union/))
- Low code quality found in TestVLMPiecewiseCudaGraph.\_run\_vlm\_mmmu\_test - 21% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))

```suggestion
                process_env |= custom_env
```

<br/><details><summary>Explanation</summary>
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

- Reduce the function length by extracting pieces of functionality out into
  their own functions. This is the most important thing you can do - ideally a
  function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
  sits together within the function rather than being scattered.</details>
</issue_to_address>

### Comment 10
<location> `test/srt/nightly/test_vlms_piecewise_cuda_graph.py:231-232` </location>
<code_context>
    def _read_output_from_files(self):
        output_lines = []

        log_files = [
            ("/tmp/server_stdout.log", "[STDOUT]"),
            ("/tmp/server_stderr.log", "[STDERR]"),
        ]
        for filename, tag in log_files:
            try:
                if os.path.exists(filename):
                    with open(filename, "r") as f:
                        for line in f:
                            output_lines.append(f"{tag} {line.rstrip()}")
            except Exception as e:
                print(f"Error reading {tag.lower()} file: {e}")

        return "\n".join(output_lines)

</code_context>

<issue_to_address>
**suggestion (code-quality):** Replace a for append loop with list extend ([`for-append-to-extend`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/for-append-to-extend/))

```suggestion
                        output_lines.extend(f"{tag} {line.rstrip()}" for line in f)
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +681 to +682
if skip_llm_forward:
return inputs_embeds
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Returning only 'inputs_embeds' when 'skip_llm_forward' is True may break downstream code expecting a tuple.

To prevent runtime errors, ensure the return type remains consistent, or update documentation to reflect the change.

Comment on lines +241 to +243
torch.randint(0, 100, (num_tokens,), device=self.device)
if not self.use_input_embeds
else None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Swap if/else branches of if expression to remove negation (swap-if-expression)

Suggested change
torch.randint(0, 100, (num_tokens,), device=self.device)
if not self.use_input_embeds
else None
None if self.use_input_embeds else torch.randint(0, 100, (num_tokens,), device=self.device)


ExplanationNegated conditions are more difficult to read than positive ones, so it is best
to avoid them where we can. By swapping the if and else conditions around we
can invert the condition and make it positive.

Comment on lines +162 to +164
result_files = glob.glob(f"{output_path}/**/*.json", recursive=True)
if not result_files:
result_files = glob.glob(f"{output_path}/*.json")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Use or for providing a fallback value (use-or-for-fallback)

Suggested change
result_files = glob.glob(f"{output_path}/**/*.json", recursive=True)
if not result_files:
result_files = glob.glob(f"{output_path}/*.json")
result_files = glob.glob(f"{output_path}/**/*.json", recursive=True) or glob.glob(f"{output_path}/*.json")


ExplanationThanks to the flexibility of Python's or operator, you can use a single
assignment statement, even if a variable can retrieve its value from various
sources. This is shorter and easier to read than using multiple assignments with
if not conditions.

Comment on lines +578 to +586
# Placeholder for post_process
new_embeddings = []
for i, (modality, embedding, index) in enumerate(
zip(modalities, embeddings, indices)
):
if embedding is None or index is None:
continue

new_embeddings.append(embedding)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): We've found these issues:

Suggested change
# Placeholder for post_process
new_embeddings = []
for i, (modality, embedding, index) in enumerate(
zip(modalities, embeddings, indices)
):
if embedding is None or index is None:
continue
new_embeddings.append(embedding)
new_embeddings = [
embedding
for modality, embedding, index in zip(modalities, embeddings, indices)
if embedding is not None and index is not None
]

"--batch_size",
str(batch_size),
"--output_path",
str(output_path),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Remove unnecessary casts to int, str, float or bool (remove-unnecessary-cast)

Suggested change
str(output_path),
output_path,

# Prepare environment variables
process_env = os.environ.copy()
if custom_env:
process_env.update(custom_env)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): We've found these issues:

Suggested change
process_env.update(custom_env)
process_env |= custom_env


Explanation
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

  • Reduce the function length by extracting pieces of functionality out into
    their own functions. This is the most important thing you can do - ideally a
    function should be less than 10 lines.
  • Reduce nesting, perhaps by introducing guard clauses to return early.
  • Ensure that variables are tightly scoped, so that code using related concepts
    sits together within the function rather than being scattered.

Comment on lines +231 to +232
for line in f:
output_lines.append(f"{tag} {line.rstrip()}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Replace a for append loop with list extend (for-append-to-extend)

Suggested change
for line in f:
output_lines.append(f"{tag} {line.rstrip()}")
output_lines.extend(f"{tag} {line.rstrip()}" for line in f)

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for piecewise CUDA graph execution for Qwen2.5-VL and other multimodal models, which is a significant feature. The approach of externalizing the multimodal preprocessing and using a model-specific post_process method is a solid design. My review includes a few suggestions to enhance the design's extensibility, correct a type hint, and clean up some minor code issues.

Comment on lines +171 to +172
# The following is just for qwen3vl, maybe not ideal to place it here.
self.use_deepstack = getattr(model_runner.model, "use_deepstack", False)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The comment correctly points out that this Qwen3-VL-specific logic is not ideal here. A generic runner like PiecewiseCudaGraphRunner should not contain logic tied to a specific model. This makes the code harder to maintain and extend.

Consider moving this logic or making it more generic. For example, the runner could check for the presence of use_deepstack and deepstack_visual_indexes attributes on the model object and derive the necessary configuration from there, rather than being hardcoded for a specific model family.

Comment on lines +820 to +853
def should_use_external_mm_preprocess(multimodal_model: nn.Module) -> bool:
"""Decide whether we should use our generic "multimodal_preprocess_routine".
We are adapting VLM for piecewise CUDA graph. Since the encoder's forward
pass cannot be executed within the model's forward pass, we need to
precompute image embeddings using the encoder within the model runner.
For models that have already been adjusted, there is a member called
should_use_external_mm_preprocess, which is set to True. In practice,
the multimodal_preprocess_routine function will be called in the
model_runner.forward_extend to handle multimodal inputs.
For models that have not yet been adapted, the general_mm_embed_routine
will still be called in the model class's forward function for processing.
Current strategy:
- Llava family (models with vision_tower + multi_modal_projector):
Their forward already calls general_mm_embed_routine and includes
built-in multimodal processing. If we run it again in ModelRunner,
it will conflict with the internal logic, so we skip it here.
- Others (such as Qwen2-VL / Qwen2.5-VL / Qwen3-VL): use the
multimodal preprocessing.
"""

cls_name = multimodal_model.__class__.__name__

qwen_vl_classes = {
"Qwen2VLForConditionalGeneration",
"Qwen2_5_VLForConditionalGeneration",
"Qwen3VLForConditionalGeneration",
"Qwen3VLMoeForConditionalGeneration",
"Qwen3OmniMoeForConditionalGeneration",
}

return cls_name in qwen_vl_classes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a hardcoded set of class names in should_use_external_mm_preprocess makes it difficult to extend this functionality to new models in the future. A more robust and maintainable approach would be to use feature detection on the model object itself. For instance, you could check for the existence of a specific attribute (e.g., model.supports_external_mm_preprocess = True) or a method (e.g., hasattr(model, "post_process")). This would make the logic more generic and decoupled from specific model implementations.

data_embedding_funcs: Dict[
Modality, Callable[[List[MultimodalDataItem]], torch.Tensor]
] = None,
) -> torch.Tensor:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The return type hint for this function is torch.Tensor, but it actually returns a ForwardBatch object on lines 874 and 939. Please update the type hint to -> ForwardBatch: for correctness and clarity.

Suggested change
) -> torch.Tensor:
) -> ForwardBatch:

num_tokens = len(forward_batch.input_ids)

if self.use_deepstack:
self.input_deepstack_embeds.zero_() # may be removed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment may be removed is unclear. If zeroing out self.input_deepstack_embeds is necessary to prevent stale data in the CUDA graph, the comment should state that for clarity. If it's truly optional, please remove the line and the comment to avoid confusion.

Suggested change
self.input_deepstack_embeds.zero_() # may be removed.
self.input_deepstack_embeds.zero_() # Clear stale data for CUDA graph correctness.

) -> torch.Tensor:
if not self.use_deepstack:
return embeddings, forward_batch
deepstack_embeddings = []

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable deepstack_embeddings is initialized but never used. It can be removed to improve code clarity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants