[DO NOT MERGE] Support Piecewise CUDA Graph for Qwen2.5-VL #17

yuan-luo · 2025-11-19T09:14:55Z

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

Summary by Sourcery

Add support for piecewise CUDA graph execution for multimodal Qwen2.5-VL (and related QwenVL models) by externalizing multimodal preprocessing and extending graph runners to handle embeddings directly.

New Features:

Introduce multimodal_preprocess_routine with resolve_language_model and should_use_external_mm_preprocess for external piecewise CUDA graph preprocessing.
Extend piecewise_cuda_graph_runner and cuda_graph_runner to use input_embeds and input_deepstack_embeds for multimodal inputs.
Update Qwen2.5-VL, Qwen2-VL, and Qwen3-VL model classes to accept input_embeds and deepstack embeddings and delegate to new post_process logic.

Enhancements:

Refactor embed_mm_inputs and general_mm_embed_routine to remove placeholder_tokens, consolidate deepstack handling into post_process, and return forward_batch.
Adjust rotary embedding layer to compute sequence length dynamically and remove static num_tokens dependency.
Modify model_runner forward_extend to invoke multimodal_preprocess_routine for multimodal models.

Tests:

Add TestPiecewiseCudaGraphQwen25VL and expand nightly integration test suite for VLM piecewise CUDA graph and MMMU benchmark evaluations.
Increase timeout and test file size in run_suite to include new tests.

…dding

Co-authored-by: Yuhao Yang <[email protected]> Co-authored-by: Yuan Luo <[email protected]>

Co-authored-by: Yuan Luo <[email protected]>

Co-authored-by: Yuan Luo <[email protected]> Co-authored-by: yhyang201 <[email protected]>

sourcery-ai · 2025-11-19T09:15:06Z

Reviewer's Guide

This PR refactors the multimodal embedding pipeline and extends both piecewise and full CUDA graph runners to support Qwen2.5-VL by introducing external preprocessing of multimodal inputs, passing precomputed input_embeds (and deepstack embeddings) through graph capture and replay, updating model wrappers to consume these tensors, and adding targeted tests for the new flow.

Sequence diagram for external multimodal preprocessing and input_embeds flow

sequenceDiagram
  participant ModelRunner
  participant "multimodal_preprocess_routine()"
  participant "embed_mm_inputs()"
  participant Model
  participant CUDA_Graph_Runner

  ModelRunner->>"multimodal_preprocess_routine()": If is_multimodal, preprocess batch
  "multimodal_preprocess_routine()"->>"embed_mm_inputs()": Compute input_embeds, deepstack_embeds
  "embed_mm_inputs()"-->>"multimodal_preprocess_routine()": Return input_embeds, update forward_batch
  "multimodal_preprocess_routine()"-->>ModelRunner: Return updated forward_batch
  ModelRunner->>CUDA_Graph_Runner: Pass forward_batch with input_embeds
  CUDA_Graph_Runner->>Model: Call forward(input_embeds, ...)
  Model-->>CUDA_Graph_Runner: Return hidden_states

Class diagram for updated ForwardBatch and model wrappers

classDiagram
  class ForwardBatch {
    +torch.Tensor input_ids
    +torch.Tensor input_embeds
    +torch.Tensor mrope_positions
    +Optional[torch.Tensor] input_deepstack_embeds
    ...
  }

  class Qwen2VLForConditionalGeneration {
    +forward(input_ids, positions, forward_batch, input_embeds=None, get_embedding=False)
  }

  class Qwen2_5_VLForConditionalGeneration {
    +forward(input_ids, positions, forward_batch, input_embeds=None, get_embedding=False, pp_proxy_tensors=None)
    +post_process(inputs_embeds, modalities, embeddings, indices, forward_batch)
  }

  class Qwen3VLForConditionalGeneration {
    +forward(input_ids, positions, forward_batch, input_embeds=None, get_embedding=False)
    +post_process(inputs_embeds, modalities, embeddings, indices, forward_batch)
  }

  class Qwen3OmniMoeForConditionalGeneration {
    +get_audio_feature()
    +get_image_feature()
    +get_video_feature()
    +get_input_embeddings()
    +post_process()
  }

  Qwen3OmniMoeForConditionalGeneration --> Qwen3VLForConditionalGeneration : delegates methods

File-Level Changes

Change	Details	Files
Refactor multimodal embedding utility to externalize preprocessing and integrate with forward_batch	Remove placeholder_tokens/use_deepstack parameters from embed_mm_inputs Compute index masks for embeddings and integrate multimodal_model.post_process Return updated forward_batch with input_embeds Add resolve_language_model and should_use_external_mm_preprocess helpers Introduce multimodal_preprocess_routine to run before model forward	`python/sglang/srt/managers/mm_utils.py`
Inject external multimodal preprocessing in ModelRunner	Call multimodal_preprocess_routine before forward_extend for multimodal models Clear input_deepstack_embeds after raw forward	`python/sglang/srt/model_executor/model_runner.py`
Extend piecewise CUDA graph runner to handle input_embeds and deepstack embeddings	Add input_embeds and optional input_deepstack_embeds tensors in init Toggle between input_ids and input_embeds during warmup_and_capture and capture_one_batch_size Copy mrope_positions and deepstack buffers in replay_prepare	`python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py`
Mirror input_embeds support in full CUDA graph runner	Add input_embeds buffer initialization alongside input_ids	`python/sglang/srt/model_executor/cuda_graph_runner.py`
Extend ForwardBatch to carry deepstack embeddings	Add input_deepstack_embeds attribute	`python/sglang/srt/model_executor/forward_batch_info.py`
Update Qwen2/VL family model wrappers to accept and forward input_embeds and deepstack buffers	Add input_embeds parameter and override in forward methods Assign forward_batch.input_embeds and input_deepstack_embeds to model calls Implement post_process in qwen3_vl and placeholder in qwen2_5_vl/qwen3_omni_moe	`python/sglang/srt/models/qwen3_vl.py` `python/sglang/srt/models/qwen2_5_vl.py` `python/sglang/srt/models/qwen2_vl.py` `python/sglang/srt/models/qwen3_omni_moe.py`
Fix rotary embedding to use dynamic sequence lengths	Replace fixed num_tokens reshape with runtime seq_len from tensor shape	`python/sglang/srt/layers/rotary_embedding.py`
Add and update tests for piecewise CUDA graph on Qwen2.5-VL	Extend existing test_piecewise_cuda_graph timeout and add TestPiecewiseCudaGraphQwen25VL Update run_suite to include nightly test Add nightly test_vlms_piecewise_cuda_graph.py covering MMMU benchmark	`test/srt/test_piecewise_cuda_graph.py` `test/srt/run_suite.py` `test/srt/nightly/test_vlms_piecewise_cuda_graph.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

gemini-code-assist · 2025-11-19T09:15:37Z

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly advances the integration of Piecewise CUDA Graph functionality for various Qwen-VL models, including Qwen2.5-VL, Qwen2-VL, and Qwen3-VL. The core of these changes involves a strategic refactoring of multimodal input processing, moving the embedding generation to an external step to optimize the capture of the language model's forward pass by CUDA graphs. This approach is designed to boost performance by reducing overhead for repetitive computations. Additionally, the update includes specialized support for Qwen3-VL's 'deepstack' embeddings and enhances the CUDA graph infrastructure to accommodate direct input embeddings. The functionality is validated through new comprehensive test suites.

Highlights

Piecewise CUDA Graph Support for Qwen-VL Models: This pull request introduces support for Piecewise CUDA Graph functionality across Qwen2.5-VL, Qwen2-VL, and Qwen3-VL models. This aims to enhance performance by capturing stable computational patterns within the model's execution flow.
Refactored Multimodal Input Preprocessing: The logic for embedding multimodal inputs (like images and videos) has been refactored to occur as an external preprocessing step before the main language model forward pass. This change is crucial for allowing the language model's operations to be more effectively captured by CUDA graphs.
Qwen3-VL Deepstack Embedding Handling: Specific handling for Qwen3-VL's unique 'deepstack' embeddings has been integrated into the new external multimodal preprocessing routine and the ForwardBatch object, ensuring compatibility and correct processing within the CUDA graph framework.
Flexible CUDA Graph Inputs: The CUDA graph runners have been updated to accept pre-computed input embeddings directly, rather than solely relying on token IDs. This flexibility is essential for multimodal models where embeddings are generated from non-textual inputs.
New Test Suites for Validation: New nightly test suites have been added to benchmark VLM models (specifically Qwen2.5-VL) with piecewise CUDA graph enabled, including accuracy verification on the MMMU benchmark and GSM8K.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

There’s a lot of duplicated multimodal post-processing and forward logic across qwen2_vl, qwen2_5_vl, and qwen3_vl—consider refactoring this into a shared base or helper to reduce code duplication and improve consistency.
The embed_mm_inputs/general_mm_embed_routine signatures have grown with many new parameters; consider encapsulating these into a config object or dataclass to improve readability and avoid mismatched arguments.
The new nightly integration test for piecewise CUDA graph is quite large and may significantly increase CI runtime—consider splitting it into smaller focused tests or marking it as a slow suite to avoid blocking regular PR feedback.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- There’s a lot of duplicated multimodal post-processing and forward logic across qwen2_vl, qwen2_5_vl, and qwen3_vl—consider refactoring this into a shared base or helper to reduce code duplication and improve consistency.
- The embed_mm_inputs/general_mm_embed_routine signatures have grown with many new parameters; consider encapsulating these into a config object or dataclass to improve readability and avoid mismatched arguments.
- The new nightly integration test for piecewise CUDA graph is quite large and may significantly increase CI runtime—consider splitting it into smaller focused tests or marking it as a slow suite to avoid blocking regular PR feedback.

## Individual Comments

### Comment 1
<location> `python/sglang/srt/managers/mm_utils.py:681-682` </location>
<code_context>
     else:
         inputs_embeds = None

+    if skip_llm_forward:
+        return inputs_embeds
+
     hidden_states = language_model(
</code_context>

<issue_to_address>
**issue (bug_risk):** Returning only 'inputs_embeds' when 'skip_llm_forward' is True may break downstream code expecting a tuple.

To prevent runtime errors, ensure the return type remains consistent, or update documentation to reflect the change.
</issue_to_address>

### Comment 2
<location> `python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py:241-243` </location>
<code_context>
                    torch.randint(0, 100, (num_tokens,), device=self.device)
                    if not self.use_input_embeds
                    else None

</code_context>

<issue_to_address>
**suggestion (code-quality):** Swap if/else branches of if expression to remove negation ([`swap-if-expression`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/swap-if-expression))

```suggestion
                    None if self.use_input_embeds else torch.randint(0, 100, (num_tokens,), device=self.device)

```

<br/><details><summary>Explanation</summary>Negated conditions are more difficult to read than positive ones, so it is best
to avoid them where we can. By swapping the `if` and `else` conditions around we
can invert the condition and make it positive.
</details>
</issue_to_address>

### Comment 3
<location> `test/srt/nightly/test_vlms_piecewise_cuda_graph.py:162-164` </location>
<code_context>
            result_files = glob.glob(f"{output_path}/**/*.json", recursive=True)
            if not result_files:
                result_files = glob.glob(f"{output_path}/*.json")

</code_context>

<issue_to_address>
**suggestion (code-quality):** Use `or` for providing a fallback value ([`use-or-for-fallback`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/use-or-for-fallback))

```suggestion
            result_files = glob.glob(f"{output_path}/**/*.json", recursive=True) or glob.glob(f"{output_path}/*.json")

```

<br/><details><summary>Explanation</summary>Thanks to the flexibility of Python's `or` operator, you can use a single
assignment statement, even if a variable can retrieve its value from various
sources. This is shorter and easier to read than using multiple assignments with
`if not` conditions.
</details>
</issue_to_address>

### Comment 4
<location> `test/srt/nightly/test_vlms_piecewise_cuda_graph.py:242-243` </location>
<code_context>

</code_context>

<issue_to_address>
**issue (code-quality):** Avoid conditionals in tests. ([`no-conditionals-in-tests`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/no-conditionals-in-tests))

<details><summary>Explanation</summary>Avoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
* loops
* conditionals

Some ways to fix this:

* Use parametrized tests to get rid of the loop.
* Move the complex logic into helpers.
* Move the complex part into pytest fixtures.

> Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / [Don't Put Logic in Tests](https://abseil.io/resources/swe-book/html/ch12.html#donapostrophet_put_logic_in_tests)
</details>
</issue_to_address>

### Comment 5
<location> `test/srt/nightly/test_vlms_piecewise_cuda_graph.py:245-246` </location>
<code_context>

</code_context>

<issue_to_address>
**issue (code-quality):** Avoid loops in tests. ([`no-loop-in-tests`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/no-loop-in-tests))

<details><summary>Explanation</summary>Avoid complex code, like loops, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
* loops
* conditionals

Some ways to fix this:

* Use parametrized tests to get rid of the loop.
* Move the complex logic into helpers.
* Move the complex part into pytest fixtures.

> Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / [Don't Put Logic in Tests](https://abseil.io/resources/swe-book/html/ch12.html#donapostrophet_put_logic_in_tests)
</details>
</issue_to_address>

### Comment 6
<location> `python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py:554` </location>
<code_context>
    def replay_prepare(
        self,
        forward_batch: ForwardBatch,
        **kwargs,
    ):
        if self.use_input_embeds:
            num_tokens = forward_batch.input_embeds.shape[0]
        else:
            num_tokens = len(forward_batch.input_ids)

        if self.use_deepstack:
            self.input_deepstack_embeds.zero_()  # may be removed.

        index = bisect.bisect_left(self.capture_num_tokens, num_tokens)
        static_num_tokens = self.capture_num_tokens[index]
        self.raw_num_tokens = num_tokens
        if static_num_tokens != num_tokens:
            self.out_cache_loc.zero_()
            self.out_cache_loc_swa.zero_()
        bs = forward_batch.batch_size

        if self.use_input_embeds:
            self.input_embeds[:num_tokens].copy_(forward_batch.input_embeds)
        else:
            self.input_ids[:num_tokens].copy_(forward_batch.input_ids)

        self.positions[:num_tokens].copy_(forward_batch.positions)
        self.out_cache_loc[:num_tokens].copy_(forward_batch.out_cache_loc)
        if forward_batch.out_cache_loc_swa is not None:
            self.out_cache_loc_swa[:num_tokens].copy_(forward_batch.out_cache_loc_swa)
        input_ids = self.input_ids[:static_num_tokens]
        positions = self.positions[:static_num_tokens]
        out_cache_loc = self.out_cache_loc[:static_num_tokens]

        out_cache_loc_swa = (
            self.out_cache_loc_swa[:static_num_tokens]
            if forward_batch.out_cache_loc_swa is not None
            else None
        )

        if forward_batch.mrope_positions is not None:
            self.mrope_positions[:, :num_tokens].copy_(forward_batch.mrope_positions)

        if self.use_input_embeds:
            input_ids = None
            input_embeds = self.input_embeds[:static_num_tokens]
        else:
            input_ids = self.input_ids[:static_num_tokens]
            input_embeds = None

        positions = self.positions[:static_num_tokens]
        out_cache_loc = self.out_cache_loc[:static_num_tokens]

        mrope_positions = (
            self.mrope_positions[:, :static_num_tokens]
            if forward_batch.mrope_positions is not None
            else None
        )

        next_token_logits_buffer = None

        input_deepstack_embeds = None
        if self.use_deepstack:
            self.input_deepstack_embeds[:num_tokens].copy_(
                forward_batch.input_deepstack_embeds
            )
            input_deepstack_embeds = self.input_deepstack_embeds[:static_num_tokens]

        static_forward_batch = ForwardBatch(
            forward_mode=forward_batch.forward_mode,
            batch_size=bs,
            input_ids=input_ids,
            input_embeds=input_embeds,
            req_pool_indices=forward_batch.req_pool_indices,
            seq_lens=forward_batch.seq_lens,
            next_token_logits_buffer=next_token_logits_buffer,
            orig_seq_lens=forward_batch.orig_seq_lens,
            seq_lens_cpu=forward_batch.seq_lens_cpu,
            req_to_token_pool=self.model_runner.req_to_token_pool,
            token_to_kv_pool=self.model_runner.token_to_kv_pool,
            attn_backend=self.model_runner.attn_backend,
            out_cache_loc=out_cache_loc,
            out_cache_loc_swa=out_cache_loc_swa,
            seq_lens_sum=forward_batch.seq_lens_sum,
            encoder_lens=forward_batch.encoder_lens,
            return_logprob=False,
            extend_seq_lens=forward_batch.extend_seq_lens,
            extend_prefix_lens=forward_batch.extend_prefix_lens,
            extend_start_loc=forward_batch.extend_start_loc,
            extend_prefix_lens_cpu=forward_batch.extend_prefix_lens_cpu,
            extend_seq_lens_cpu=forward_batch.extend_seq_lens_cpu,
            extend_logprob_start_lens_cpu=forward_batch.extend_logprob_start_lens_cpu,
            extend_num_tokens=forward_batch.extend_num_tokens,
            extend_input_logprob_token_ids_gpu=forward_batch.extend_input_logprob_token_ids_gpu,
            positions=positions,
            global_num_tokens_gpu=forward_batch.global_num_tokens_gpu,
            global_num_tokens_for_logprob_gpu=forward_batch.global_num_tokens_for_logprob_gpu,
            dp_padding_mode=forward_batch.dp_padding_mode,
            global_dp_buffer_len=forward_batch.global_dp_buffer_len,
            mrope_positions=mrope_positions,
            spec_algorithm=forward_batch.spec_algorithm,
            spec_info=forward_batch.spec_info,
            capture_hidden_mode=forward_batch.capture_hidden_mode,
            num_token_non_padded=forward_batch.num_token_non_padded,
            global_forward_mode=forward_batch.global_forward_mode,
            lora_ids=forward_batch.lora_ids,
            sampling_info=forward_batch.sampling_info,
            mm_inputs=forward_batch.mm_inputs,
            temp_scaled_logprobs=forward_batch.temp_scaled_logprobs,
            temperature=forward_batch.temperature,
            top_p_normalized_logprobs=forward_batch.top_p_normalized_logprobs,
            top_p=forward_batch.top_p,
            input_deepstack_embeds=input_deepstack_embeds,
        )

        return static_forward_batch

</code_context>

<issue_to_address>
**issue (code-quality):** Inline variable that is immediately returned ([`inline-immediately-returned-variable`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/inline-immediately-returned-variable/))
</issue_to_address>

### Comment 7
<location> `python/sglang/srt/models/qwen2_5_vl.py:578-586` </location>
<code_context>
    def post_process(
        self,
        inputs_embeds,
        modalities: List[Modality],
        embeddings: List[torch.Tensor],
        indices: List[torch.Tensor],
        forward_batch: ForwardBatch,
    ) -> torch.Tensor:
        # Placeholder for post_process
        new_embeddings = []
        for i, (modality, embedding, index) in enumerate(
            zip(modalities, embeddings, indices)
        ):
            if embedding is None or index is None:
                continue

            new_embeddings.append(embedding)
        return new_embeddings, forward_batch

</code_context>

<issue_to_address>
**suggestion (code-quality):** We've found these issues:

- Convert for loop into list comprehension ([`list-comprehension`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/list-comprehension/))
- Remove unnecessary calls to `enumerate` when the index is not used ([`remove-unused-enumerate`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-unused-enumerate/))
- Lift code into else after jump in control flow ([`reintroduce-else`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/reintroduce-else/))
- Remove redundant continue statement ([`remove-redundant-continue`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-redundant-continue/))

```suggestion
        new_embeddings = [
            embedding
            for modality, embedding, index in zip(modalities, embeddings, indices)
            if embedding is not None and index is not None
        ]
```
</issue_to_address>

### Comment 8
<location> `test/srt/nightly/test_vlms_piecewise_cuda_graph.py:85` </location>
<code_context>
    def run_mmmu_eval(
        self,
        model_version: str,
        output_path: str,
        *,
        env: dict | None = None,
    ):
        """
        Evaluate a VLM on the MMMU validation set with lmms‑eval.
        Only `model_version` (checkpoint) and `chat_template` vary;
        We are focusing only on the validation set due to resource constraints.
        """
        # -------- fixed settings --------
        model = "openai_compatible"
        tp = 1
        tasks = "mmmu_val"
        batch_size = 32
        log_suffix = "openai_compatible"
        os.makedirs(output_path, exist_ok=True)

        # -------- compose --model_args --------
        model_args = f'model_version="{model_version}",' f"tp={tp}"

        # -------- build command list --------
        cmd = [
            "python3",
            "-m",
            "lmms_eval",
            "--model",
            model,
            "--model_args",
            model_args,
            "--tasks",
            tasks,
            "--batch_size",
            str(batch_size),
            "--output_path",
            str(output_path),
        ]

        subprocess.run(
            cmd,
            check=True,
            timeout=3600,
        )

</code_context>

<issue_to_address>
**suggestion (code-quality):** Remove unnecessary casts to int, str, float or bool ([`remove-unnecessary-cast`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-unnecessary-cast/))

```suggestion
            output_path,
```
</issue_to_address>

### Comment 9
<location> `test/srt/nightly/test_vlms_piecewise_cuda_graph.py:123` </location>
<code_context>
    def _run_vlm_mmmu_test(
        self,
        model,
        output_path,
        test_name="",
        custom_env=None,
        log_level="info",
        capture_output=False,
    ):
        """
        Common method to run VLM MMMU benchmark test.
        Args:
            model: Model to test
            output_path: Path for output logs
            test_name: Optional test name for logging
            custom_env: Optional custom environment variables
            log_level: Log level for server (default: "info")
            capture_output: Whether to capture server stdout/stderr
        """
        print(f"\nTesting model: {model.model}{test_name}")

        process = None
        mmmu_accuracy = 0  # Initialize to handle potential exceptions
        server_output = ""

        try:
            # Prepare environment variables
            process_env = os.environ.copy()
            if custom_env:
                process_env.update(custom_env)
            # if test vlm with cuda_ipc feature, open this env_var
            process_env["SGLANG_USE_CUDA_IPC_TRANSPORT"] = "1"

            # Prepare stdout/stderr redirection if needed
            stdout_file = None
            stderr_file = None
            if capture_output:
                stdout_file = open("/tmp/server_stdout.log", "w")
                stderr_file = open("/tmp/server_stderr.log", "w")

            # Launch server for testing
            process = popen_launch_server(
                model.model,
                base_url=self.base_url,
                timeout=self.time_out,
                api_key=self.api_key,
                other_args=[
                    "--trust-remote-code",
                    "--piecewise-cuda-graph-max-tokens",
                    "8192",
                    "--enable-piecewise-cuda-graph",
                    "--tp=8",
                    "--piecewise-cuda-graph-compiler=eager",
                    "--disable-radix-cache",
                    "--log-level",
                    log_level,
                ],
                env=process_env,
                return_stdout_stderr=(
                    (stdout_file, stderr_file) if capture_output else None
                ),
            )

            # Run evaluation
            self.run_mmmu_eval(model.model, output_path)

            # Get the result file
            # Search recursively for JSON result files (lmms-eval v0.4.1+ creates subdirectories)
            result_files = glob.glob(f"{output_path}/**/*.json", recursive=True)
            if not result_files:
                result_files = glob.glob(f"{output_path}/*.json")

            if not result_files:
                raise FileNotFoundError(f"No JSON result files found in {output_path}")

            result_file_path = result_files[0]

            with open(result_file_path, "r") as f:
                result = json.load(f)
                print(f"Result{test_name}\n: {result}")

            # Process the result
            mmmu_accuracy = result["results"]["mmmu_val"]["mmmu_acc,none"]
            print(
                f"Model {model.model} achieved accuracy{test_name}: {mmmu_accuracy:.4f}"
            )

            # Capture server output if requested
            if capture_output and process:
                server_output = self._read_output_from_files()

            # Assert performance meets expected threshold
            self.assertGreaterEqual(
                mmmu_accuracy,
                model.mmmu_accuracy,
                f"Model {model.model} accuracy ({mmmu_accuracy:.4f}) below expected threshold ({model.mmmu_accuracy:.4f}){test_name}",
            )

            return server_output

        except Exception as e:
            print(f"Error testing {model.model}{test_name}: {e}")
            self.fail(f"Test failed for {model.model}{test_name}: {e}")

        finally:
            # Ensure process cleanup happens regardless of success/failure
            if process is not None and process.poll() is None:
                print(f"Cleaning up process {process.pid}")
                try:
                    kill_process_tree(process.pid)
                except Exception as e:
                    print(f"Error killing process: {e}")

            # clean up temporary files
            if capture_output:
                if stdout_file:
                    stdout_file.close()
                if stderr_file:
                    stderr_file.close()
                for filename in ["/tmp/server_stdout.log", "/tmp/server_stderr.log"]:
                    try:
                        if os.path.exists(filename):
                            os.remove(filename)
                    except Exception as e:
                        print(f"Error removing {filename}: {e}")

</code_context>

<issue_to_address>
**suggestion (code-quality):** We've found these issues:

- Merge dictionary updates via the union operator ([`dict-assign-update-to-union`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/dict-assign-update-to-union/))
- Low code quality found in TestVLMPiecewiseCudaGraph.\_run\_vlm\_mmmu\_test - 21% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))

```suggestion
                process_env |= custom_env
```

<br/><details><summary>Explanation</summary>
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

- Reduce the function length by extracting pieces of functionality out into
  their own functions. This is the most important thing you can do - ideally a
  function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
  sits together within the function rather than being scattered.</details>
</issue_to_address>

### Comment 10
<location> `test/srt/nightly/test_vlms_piecewise_cuda_graph.py:231-232` </location>
<code_context>
    def _read_output_from_files(self):
        output_lines = []

        log_files = [
            ("/tmp/server_stdout.log", "[STDOUT]"),
            ("/tmp/server_stderr.log", "[STDERR]"),
        ]
        for filename, tag in log_files:
            try:
                if os.path.exists(filename):
                    with open(filename, "r") as f:
                        for line in f:
                            output_lines.append(f"{tag} {line.rstrip()}")
            except Exception as e:
                print(f"Error reading {tag.lower()} file: {e}")

        return "\n".join(output_lines)

</code_context>

<issue_to_address>
**suggestion (code-quality):** Replace a for append loop with list extend ([`for-append-to-extend`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/for-append-to-extend/))

```suggestion
                        output_lines.extend(f"{tag} {line.rstrip()}" for line in f)
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-11-19T09:16:25Z

python/sglang/srt/managers/mm_utils.py

+    if skip_llm_forward:
+        return inputs_embeds


issue (bug_risk): Returning only 'inputs_embeds' when 'skip_llm_forward' is True may break downstream code expecting a tuple.

To prevent runtime errors, ensure the return type remains consistent, or update documentation to reflect the change.

sourcery-ai · 2025-11-19T09:16:25Z

python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py

+                    torch.randint(0, 100, (num_tokens,), device=self.device)
+                    if not self.use_input_embeds
+                    else None


suggestion (code-quality): Swap if/else branches of if expression to remove negation (swap-if-expression)

Suggested change

torch.randint(0, 100, (num_tokens,), device=self.device)

if not self.use_input_embeds

else None

None if self.use_input_embeds else torch.randint(0, 100, (num_tokens,), device=self.device)

Explanation
Negated conditions are more difficult to read than positive ones, so it is best
to avoid them where we can. By swapping the if and else conditions around we
can invert the condition and make it positive.

sourcery-ai · 2025-11-19T09:16:26Z

test/srt/nightly/test_vlms_piecewise_cuda_graph.py

+            result_files = glob.glob(f"{output_path}/**/*.json", recursive=True)
+            if not result_files:
+                result_files = glob.glob(f"{output_path}/*.json")


suggestion (code-quality): Use or for providing a fallback value (use-or-for-fallback)

Suggested change

result_files = glob.glob(f"{output_path}/**/*.json", recursive=True)

if not result_files:

result_files = glob.glob(f"{output_path}/*.json")

result_files = glob.glob(f"{output_path}/**/*.json", recursive=True) or glob.glob(f"{output_path}/*.json")

Explanation
Thanks to the flexibility of Python's or operator, you can use a single
assignment statement, even if a variable can retrieve its value from various
sources. This is shorter and easier to read than using multiple assignments with
if not conditions.

sourcery-ai · 2025-11-19T09:16:26Z

python/sglang/srt/models/qwen2_5_vl.py

+        # Placeholder for post_process
+        new_embeddings = []
+        for i, (modality, embedding, index) in enumerate(
+            zip(modalities, embeddings, indices)
+        ):
+            if embedding is None or index is None:
+                continue
+
+            new_embeddings.append(embedding)


suggestion (code-quality): We've found these issues:

Convert for loop into list comprehension (list-comprehension)

Remove unnecessary calls to enumerate when the index is not used (remove-unused-enumerate)

Lift code into else after jump in control flow (reintroduce-else)

Remove redundant continue statement (remove-redundant-continue)

Suggested change

# Placeholder for post_process

new_embeddings = []

for i, (modality, embedding, index) in enumerate(

zip(modalities, embeddings, indices)

):

if embedding is None or index is None:

continue

new_embeddings.append(embedding)

new_embeddings = [

embedding

for modality, embedding, index in zip(modalities, embeddings, indices)

if embedding is not None and index is not None

]

sourcery-ai · 2025-11-19T09:16:26Z

test/srt/nightly/test_vlms_piecewise_cuda_graph.py

+            "--batch_size",
+            str(batch_size),
+            "--output_path",
+            str(output_path),


suggestion (code-quality): Remove unnecessary casts to int, str, float or bool (remove-unnecessary-cast)

Suggested change

str(output_path),

output_path,

sourcery-ai · 2025-11-19T09:16:26Z

test/srt/nightly/test_vlms_piecewise_cuda_graph.py

+            # Prepare environment variables
+            process_env = os.environ.copy()
+            if custom_env:
+                process_env.update(custom_env)


suggestion (code-quality): We've found these issues:

Merge dictionary updates via the union operator (dict-assign-update-to-union)

Low code quality found in TestVLMPiecewiseCudaGraph._run_vlm_mmmu_test - 21% (low-code-quality)

Suggested change

process_env.update(custom_env)

process_env |= custom_env

Explanation

The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines.

Reduce nesting, perhaps by introducing guard clauses to return early.

Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.

sourcery-ai · 2025-11-19T09:16:26Z

test/srt/nightly/test_vlms_piecewise_cuda_graph.py

+                        for line in f:
+                            output_lines.append(f"{tag} {line.rstrip()}")


suggestion (code-quality): Replace a for append loop with list extend (for-append-to-extend)

Suggested change

for line in f:

output_lines.append(f"{tag} {line.rstrip()}")

output_lines.extend(f"{tag} {line.rstrip()}" for line in f)

gemini-code-assist

Code Review

This pull request introduces support for piecewise CUDA graph execution for Qwen2.5-VL and other multimodal models, which is a significant feature. The approach of externalizing the multimodal preprocessing and using a model-specific post_process method is a solid design. My review includes a few suggestions to enhance the design's extensibility, correct a type hint, and clean up some minor code issues.

gemini-code-assist · 2025-11-19T09:17:16Z

python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py

+        # The following is just for qwen3vl, maybe not ideal to place it here.
+        self.use_deepstack = getattr(model_runner.model, "use_deepstack", False)


The comment correctly points out that this Qwen3-VL-specific logic is not ideal here. A generic runner like PiecewiseCudaGraphRunner should not contain logic tied to a specific model. This makes the code harder to maintain and extend.

Consider moving this logic or making it more generic. For example, the runner could check for the presence of use_deepstack and deepstack_visual_indexes attributes on the model object and derive the necessary configuration from there, rather than being hardcoded for a specific model family.

gemini-code-assist · 2025-11-19T09:17:16Z

python/sglang/srt/managers/mm_utils.py

+def should_use_external_mm_preprocess(multimodal_model: nn.Module) -> bool:
+    """Decide whether we should use our generic "multimodal_preprocess_routine".
+
+    We are adapting VLM for piecewise CUDA graph. Since the encoder's forward
+    pass cannot be executed within the model's forward pass, we need to
+    precompute image embeddings using the encoder within the model runner.
+    For models that have already been adjusted, there is a member called
+    should_use_external_mm_preprocess, which is set to True. In practice,
+    the multimodal_preprocess_routine function will be called in the
+    model_runner.forward_extend to handle multimodal inputs.
+
+    For models that have not yet been adapted, the general_mm_embed_routine
+    will still be called in the model class's forward function for processing.
+
+    Current strategy:
+        - Llava family (models with vision_tower + multi_modal_projector):
+        Their forward already calls general_mm_embed_routine and includes
+        built-in multimodal processing. If we run it again in ModelRunner,
+        it will conflict with the internal logic, so we skip it here.
+        - Others (such as Qwen2-VL / Qwen2.5-VL / Qwen3-VL): use the
+        multimodal preprocessing.
+    """
+
+    cls_name = multimodal_model.__class__.__name__
+
+    qwen_vl_classes = {
+        "Qwen2VLForConditionalGeneration",
+        "Qwen2_5_VLForConditionalGeneration",
+        "Qwen3VLForConditionalGeneration",
+        "Qwen3VLMoeForConditionalGeneration",
+        "Qwen3OmniMoeForConditionalGeneration",
+    }
+
+    return cls_name in qwen_vl_classes


Using a hardcoded set of class names in should_use_external_mm_preprocess makes it difficult to extend this functionality to new models in the future. A more robust and maintainable approach would be to use feature detection on the model object itself. For instance, you could check for the existence of a specific attribute (e.g., model.supports_external_mm_preprocess = True) or a method (e.g., hasattr(model, "post_process")). This would make the logic more generic and decoupled from specific model implementations.

gemini-code-assist · 2025-11-19T09:17:16Z

python/sglang/srt/managers/mm_utils.py

+    data_embedding_funcs: Dict[
+        Modality, Callable[[List[MultimodalDataItem]], torch.Tensor]
+    ] = None,
+) -> torch.Tensor:


The return type hint for this function is torch.Tensor, but it actually returns a ForwardBatch object on lines 874 and 939. Please update the type hint to -> ForwardBatch: for correctness and clarity.

Suggested change

) -> torch.Tensor:

) -> ForwardBatch:

gemini-code-assist · 2025-11-19T09:17:17Z

python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py

+            num_tokens = len(forward_batch.input_ids)
+
+        if self.use_deepstack:
+            self.input_deepstack_embeds.zero_()  # may be removed.


The comment may be removed is unclear. If zeroing out self.input_deepstack_embeds is necessary to prevent stale data in the CUDA graph, the comment should state that for clarity. If it's truly optional, please remove the line and the comment to avoid confusion.

Suggested change

self.input_deepstack_embeds.zero_() # may be removed.

self.input_deepstack_embeds.zero_() # Clear stale data for CUDA graph correctness.

gemini-code-assist · 2025-11-19T09:17:17Z

python/sglang/srt/models/qwen3_vl.py

+    ) -> torch.Tensor:
+        if not self.use_deepstack:
+            return embeddings, forward_batch
+        deepstack_embeddings = []


The variable deepstack_embeddings is initialized but never used. It can be removed to improve code clarity.

luoyuan.luo and others added 11 commits November 19, 2025 11:33

Support vlm piecewise cuda graph - fix empty positions in rotary_embe…

0c334a2

…dding

Support processing mm outside model

471a51f

support input_deepstack_embeds outside model

fd1cd1a

Co-authored-by: Yuhao Yang <[email protected]> Co-authored-by: Yuan Luo <[email protected]>

Support qwen2_5_vl piecewise cuda graph

db5c290

Support piecewise CUDA graph interact with PP in VLM

4e1a121

Co-authored-by: Yuan Luo <[email protected]>

Revise piecewise cuda graph framework for vlm

79a2da2

Co-authored-by: Yuan Luo <[email protected]>

Resolve llava and VideoOpenAI model due to different ViT archtecture

b0b5ea4

Co-authored-by: Yuan Luo <[email protected]>

Add unit test for pcg vlm

4ffe08a

Revert and add back qwen3_vl change

7702eb0

Co-authored-by: Yuan Luo <[email protected]> Co-authored-by: yhyang201 <[email protected]>

Add lmms_eval unit test for pcg

bc54e4c

Co-authored-by: Yuan Luo <[email protected]> Co-authored-by: yhyang201 <[email protected]>

Address review comments and fix ci

1184918

Co-authored-by: Yuan Luo <[email protected]> Co-authored-by: yhyang201 <[email protected]>

github-actions bot added the Multi-modal label Nov 19, 2025

yuan-luo marked this pull request as draft November 19, 2025 09:15

sourcery-ai bot reviewed Nov 19, 2025

View reviewed changes

gemini-code-assist bot reviewed Nov 19, 2025

View reviewed changes

	for line in f:
	output_lines.append(f"{tag} {line.rstrip()}")
	output_lines.extend(f"{tag} {line.rstrip()}" for line in f)

		# The following is just for qwen3vl, maybe not ideal to place it here.
		self.use_deepstack = getattr(model_runner.model, "use_deepstack", False)

	self.input_deepstack_embeds.zero_() # may be removed.
	self.input_deepstack_embeds.zero_() # Clear stale data for CUDA graph correctness.

[DO NOT MERGE] Support Piecewise CUDA Graph for Qwen2.5-VL #17

Are you sure you want to change the base?

[DO NOT MERGE] Support Piecewise CUDA Graph for Qwen2.5-VL #17

Conversation

yuan-luo commented Nov 19, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for external multimodal preprocessing and input_embeds flow

Class diagram for updated ForwardBatch and model wrappers

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

gemini-code-assist bot commented Nov 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuan-luo commented Nov 19, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Nov 19, 2025 •

edited

Loading