-
Notifications
You must be signed in to change notification settings - Fork 4
[DO NOT MERGE] Support Piecewise CUDA Graph for Qwen2.5-VL #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Yuhao Yang <[email protected]> Co-authored-by: Yuan Luo <[email protected]>
Co-authored-by: Yuan Luo <[email protected]>
Co-authored-by: Yuan Luo <[email protected]>
Co-authored-by: Yuan Luo <[email protected]>
Co-authored-by: Yuan Luo <[email protected]> Co-authored-by: yhyang201 <[email protected]>
Co-authored-by: Yuan Luo <[email protected]> Co-authored-by: yhyang201 <[email protected]>
Co-authored-by: Yuan Luo <[email protected]> Co-authored-by: yhyang201 <[email protected]>
Reviewer's GuideThis PR refactors the multimodal embedding pipeline and extends both piecewise and full CUDA graph runners to support Qwen2.5-VL by introducing external preprocessing of multimodal inputs, passing precomputed input_embeds (and deepstack embeddings) through graph capture and replay, updating model wrappers to consume these tensors, and adding targeted tests for the new flow. Sequence diagram for external multimodal preprocessing and input_embeds flowsequenceDiagram
participant ModelRunner
participant "multimodal_preprocess_routine()"
participant "embed_mm_inputs()"
participant Model
participant CUDA_Graph_Runner
ModelRunner->>"multimodal_preprocess_routine()": If is_multimodal, preprocess batch
"multimodal_preprocess_routine()"->>"embed_mm_inputs()": Compute input_embeds, deepstack_embeds
"embed_mm_inputs()"-->>"multimodal_preprocess_routine()": Return input_embeds, update forward_batch
"multimodal_preprocess_routine()"-->>ModelRunner: Return updated forward_batch
ModelRunner->>CUDA_Graph_Runner: Pass forward_batch with input_embeds
CUDA_Graph_Runner->>Model: Call forward(input_embeds, ...)
Model-->>CUDA_Graph_Runner: Return hidden_states
Class diagram for updated ForwardBatch and model wrappersclassDiagram
class ForwardBatch {
+torch.Tensor input_ids
+torch.Tensor input_embeds
+torch.Tensor mrope_positions
+Optional[torch.Tensor] input_deepstack_embeds
...
}
class Qwen2VLForConditionalGeneration {
+forward(input_ids, positions, forward_batch, input_embeds=None, get_embedding=False)
}
class Qwen2_5_VLForConditionalGeneration {
+forward(input_ids, positions, forward_batch, input_embeds=None, get_embedding=False, pp_proxy_tensors=None)
+post_process(inputs_embeds, modalities, embeddings, indices, forward_batch)
}
class Qwen3VLForConditionalGeneration {
+forward(input_ids, positions, forward_batch, input_embeds=None, get_embedding=False)
+post_process(inputs_embeds, modalities, embeddings, indices, forward_batch)
}
class Qwen3OmniMoeForConditionalGeneration {
+get_audio_feature()
+get_image_feature()
+get_video_feature()
+get_input_embeddings()
+post_process()
}
Qwen3OmniMoeForConditionalGeneration --> Qwen3VLForConditionalGeneration : delegates methods
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Summary of ChangesHello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly advances the integration of Piecewise CUDA Graph functionality for various Qwen-VL models, including Qwen2.5-VL, Qwen2-VL, and Qwen3-VL. The core of these changes involves a strategic refactoring of multimodal input processing, moving the embedding generation to an external step to optimize the capture of the language model's forward pass by CUDA graphs. This approach is designed to boost performance by reducing overhead for repetitive computations. Additionally, the update includes specialized support for Qwen3-VL's 'deepstack' embeddings and enhances the CUDA graph infrastructure to accommodate direct input embeddings. The functionality is validated through new comprehensive test suites. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- There’s a lot of duplicated multimodal post-processing and forward logic across qwen2_vl, qwen2_5_vl, and qwen3_vl—consider refactoring this into a shared base or helper to reduce code duplication and improve consistency.
- The embed_mm_inputs/general_mm_embed_routine signatures have grown with many new parameters; consider encapsulating these into a config object or dataclass to improve readability and avoid mismatched arguments.
- The new nightly integration test for piecewise CUDA graph is quite large and may significantly increase CI runtime—consider splitting it into smaller focused tests or marking it as a slow suite to avoid blocking regular PR feedback.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- There’s a lot of duplicated multimodal post-processing and forward logic across qwen2_vl, qwen2_5_vl, and qwen3_vl—consider refactoring this into a shared base or helper to reduce code duplication and improve consistency.
- The embed_mm_inputs/general_mm_embed_routine signatures have grown with many new parameters; consider encapsulating these into a config object or dataclass to improve readability and avoid mismatched arguments.
- The new nightly integration test for piecewise CUDA graph is quite large and may significantly increase CI runtime—consider splitting it into smaller focused tests or marking it as a slow suite to avoid blocking regular PR feedback.
## Individual Comments
### Comment 1
<location> `python/sglang/srt/managers/mm_utils.py:681-682` </location>
<code_context>
else:
inputs_embeds = None
+ if skip_llm_forward:
+ return inputs_embeds
+
hidden_states = language_model(
</code_context>
<issue_to_address>
**issue (bug_risk):** Returning only 'inputs_embeds' when 'skip_llm_forward' is True may break downstream code expecting a tuple.
To prevent runtime errors, ensure the return type remains consistent, or update documentation to reflect the change.
</issue_to_address>
### Comment 2
<location> `python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py:241-243` </location>
<code_context>
torch.randint(0, 100, (num_tokens,), device=self.device)
if not self.use_input_embeds
else None
</code_context>
<issue_to_address>
**suggestion (code-quality):** Swap if/else branches of if expression to remove negation ([`swap-if-expression`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/swap-if-expression))
```suggestion
None if self.use_input_embeds else torch.randint(0, 100, (num_tokens,), device=self.device)
```
<br/><details><summary>Explanation</summary>Negated conditions are more difficult to read than positive ones, so it is best
to avoid them where we can. By swapping the `if` and `else` conditions around we
can invert the condition and make it positive.
</details>
</issue_to_address>
### Comment 3
<location> `test/srt/nightly/test_vlms_piecewise_cuda_graph.py:162-164` </location>
<code_context>
result_files = glob.glob(f"{output_path}/**/*.json", recursive=True)
if not result_files:
result_files = glob.glob(f"{output_path}/*.json")
</code_context>
<issue_to_address>
**suggestion (code-quality):** Use `or` for providing a fallback value ([`use-or-for-fallback`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/use-or-for-fallback))
```suggestion
result_files = glob.glob(f"{output_path}/**/*.json", recursive=True) or glob.glob(f"{output_path}/*.json")
```
<br/><details><summary>Explanation</summary>Thanks to the flexibility of Python's `or` operator, you can use a single
assignment statement, even if a variable can retrieve its value from various
sources. This is shorter and easier to read than using multiple assignments with
`if not` conditions.
</details>
</issue_to_address>
### Comment 4
<location> `test/srt/nightly/test_vlms_piecewise_cuda_graph.py:242-243` </location>
<code_context>
</code_context>
<issue_to_address>
**issue (code-quality):** Avoid conditionals in tests. ([`no-conditionals-in-tests`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/no-conditionals-in-tests))
<details><summary>Explanation</summary>Avoid complex code, like conditionals, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
* loops
* conditionals
Some ways to fix this:
* Use parametrized tests to get rid of the loop.
* Move the complex logic into helpers.
* Move the complex part into pytest fixtures.
> Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / [Don't Put Logic in Tests](https://abseil.io/resources/swe-book/html/ch12.html#donapostrophet_put_logic_in_tests)
</details>
</issue_to_address>
### Comment 5
<location> `test/srt/nightly/test_vlms_piecewise_cuda_graph.py:245-246` </location>
<code_context>
</code_context>
<issue_to_address>
**issue (code-quality):** Avoid loops in tests. ([`no-loop-in-tests`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/no-loop-in-tests))
<details><summary>Explanation</summary>Avoid complex code, like loops, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
* loops
* conditionals
Some ways to fix this:
* Use parametrized tests to get rid of the loop.
* Move the complex logic into helpers.
* Move the complex part into pytest fixtures.
> Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / [Don't Put Logic in Tests](https://abseil.io/resources/swe-book/html/ch12.html#donapostrophet_put_logic_in_tests)
</details>
</issue_to_address>
### Comment 6
<location> `python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py:554` </location>
<code_context>
def replay_prepare(
self,
forward_batch: ForwardBatch,
**kwargs,
):
if self.use_input_embeds:
num_tokens = forward_batch.input_embeds.shape[0]
else:
num_tokens = len(forward_batch.input_ids)
if self.use_deepstack:
self.input_deepstack_embeds.zero_() # may be removed.
index = bisect.bisect_left(self.capture_num_tokens, num_tokens)
static_num_tokens = self.capture_num_tokens[index]
self.raw_num_tokens = num_tokens
if static_num_tokens != num_tokens:
self.out_cache_loc.zero_()
self.out_cache_loc_swa.zero_()
bs = forward_batch.batch_size
if self.use_input_embeds:
self.input_embeds[:num_tokens].copy_(forward_batch.input_embeds)
else:
self.input_ids[:num_tokens].copy_(forward_batch.input_ids)
self.positions[:num_tokens].copy_(forward_batch.positions)
self.out_cache_loc[:num_tokens].copy_(forward_batch.out_cache_loc)
if forward_batch.out_cache_loc_swa is not None:
self.out_cache_loc_swa[:num_tokens].copy_(forward_batch.out_cache_loc_swa)
input_ids = self.input_ids[:static_num_tokens]
positions = self.positions[:static_num_tokens]
out_cache_loc = self.out_cache_loc[:static_num_tokens]
out_cache_loc_swa = (
self.out_cache_loc_swa[:static_num_tokens]
if forward_batch.out_cache_loc_swa is not None
else None
)
if forward_batch.mrope_positions is not None:
self.mrope_positions[:, :num_tokens].copy_(forward_batch.mrope_positions)
if self.use_input_embeds:
input_ids = None
input_embeds = self.input_embeds[:static_num_tokens]
else:
input_ids = self.input_ids[:static_num_tokens]
input_embeds = None
positions = self.positions[:static_num_tokens]
out_cache_loc = self.out_cache_loc[:static_num_tokens]
mrope_positions = (
self.mrope_positions[:, :static_num_tokens]
if forward_batch.mrope_positions is not None
else None
)
next_token_logits_buffer = None
input_deepstack_embeds = None
if self.use_deepstack:
self.input_deepstack_embeds[:num_tokens].copy_(
forward_batch.input_deepstack_embeds
)
input_deepstack_embeds = self.input_deepstack_embeds[:static_num_tokens]
static_forward_batch = ForwardBatch(
forward_mode=forward_batch.forward_mode,
batch_size=bs,
input_ids=input_ids,
input_embeds=input_embeds,
req_pool_indices=forward_batch.req_pool_indices,
seq_lens=forward_batch.seq_lens,
next_token_logits_buffer=next_token_logits_buffer,
orig_seq_lens=forward_batch.orig_seq_lens,
seq_lens_cpu=forward_batch.seq_lens_cpu,
req_to_token_pool=self.model_runner.req_to_token_pool,
token_to_kv_pool=self.model_runner.token_to_kv_pool,
attn_backend=self.model_runner.attn_backend,
out_cache_loc=out_cache_loc,
out_cache_loc_swa=out_cache_loc_swa,
seq_lens_sum=forward_batch.seq_lens_sum,
encoder_lens=forward_batch.encoder_lens,
return_logprob=False,
extend_seq_lens=forward_batch.extend_seq_lens,
extend_prefix_lens=forward_batch.extend_prefix_lens,
extend_start_loc=forward_batch.extend_start_loc,
extend_prefix_lens_cpu=forward_batch.extend_prefix_lens_cpu,
extend_seq_lens_cpu=forward_batch.extend_seq_lens_cpu,
extend_logprob_start_lens_cpu=forward_batch.extend_logprob_start_lens_cpu,
extend_num_tokens=forward_batch.extend_num_tokens,
extend_input_logprob_token_ids_gpu=forward_batch.extend_input_logprob_token_ids_gpu,
positions=positions,
global_num_tokens_gpu=forward_batch.global_num_tokens_gpu,
global_num_tokens_for_logprob_gpu=forward_batch.global_num_tokens_for_logprob_gpu,
dp_padding_mode=forward_batch.dp_padding_mode,
global_dp_buffer_len=forward_batch.global_dp_buffer_len,
mrope_positions=mrope_positions,
spec_algorithm=forward_batch.spec_algorithm,
spec_info=forward_batch.spec_info,
capture_hidden_mode=forward_batch.capture_hidden_mode,
num_token_non_padded=forward_batch.num_token_non_padded,
global_forward_mode=forward_batch.global_forward_mode,
lora_ids=forward_batch.lora_ids,
sampling_info=forward_batch.sampling_info,
mm_inputs=forward_batch.mm_inputs,
temp_scaled_logprobs=forward_batch.temp_scaled_logprobs,
temperature=forward_batch.temperature,
top_p_normalized_logprobs=forward_batch.top_p_normalized_logprobs,
top_p=forward_batch.top_p,
input_deepstack_embeds=input_deepstack_embeds,
)
return static_forward_batch
</code_context>
<issue_to_address>
**issue (code-quality):** Inline variable that is immediately returned ([`inline-immediately-returned-variable`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/inline-immediately-returned-variable/))
</issue_to_address>
### Comment 7
<location> `python/sglang/srt/models/qwen2_5_vl.py:578-586` </location>
<code_context>
def post_process(
self,
inputs_embeds,
modalities: List[Modality],
embeddings: List[torch.Tensor],
indices: List[torch.Tensor],
forward_batch: ForwardBatch,
) -> torch.Tensor:
# Placeholder for post_process
new_embeddings = []
for i, (modality, embedding, index) in enumerate(
zip(modalities, embeddings, indices)
):
if embedding is None or index is None:
continue
new_embeddings.append(embedding)
return new_embeddings, forward_batch
</code_context>
<issue_to_address>
**suggestion (code-quality):** We've found these issues:
- Convert for loop into list comprehension ([`list-comprehension`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/list-comprehension/))
- Remove unnecessary calls to `enumerate` when the index is not used ([`remove-unused-enumerate`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-unused-enumerate/))
- Lift code into else after jump in control flow ([`reintroduce-else`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/reintroduce-else/))
- Remove redundant continue statement ([`remove-redundant-continue`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-redundant-continue/))
```suggestion
new_embeddings = [
embedding
for modality, embedding, index in zip(modalities, embeddings, indices)
if embedding is not None and index is not None
]
```
</issue_to_address>
### Comment 8
<location> `test/srt/nightly/test_vlms_piecewise_cuda_graph.py:85` </location>
<code_context>
def run_mmmu_eval(
self,
model_version: str,
output_path: str,
*,
env: dict | None = None,
):
"""
Evaluate a VLM on the MMMU validation set with lmms‑eval.
Only `model_version` (checkpoint) and `chat_template` vary;
We are focusing only on the validation set due to resource constraints.
"""
# -------- fixed settings --------
model = "openai_compatible"
tp = 1
tasks = "mmmu_val"
batch_size = 32
log_suffix = "openai_compatible"
os.makedirs(output_path, exist_ok=True)
# -------- compose --model_args --------
model_args = f'model_version="{model_version}",' f"tp={tp}"
# -------- build command list --------
cmd = [
"python3",
"-m",
"lmms_eval",
"--model",
model,
"--model_args",
model_args,
"--tasks",
tasks,
"--batch_size",
str(batch_size),
"--output_path",
str(output_path),
]
subprocess.run(
cmd,
check=True,
timeout=3600,
)
</code_context>
<issue_to_address>
**suggestion (code-quality):** Remove unnecessary casts to int, str, float or bool ([`remove-unnecessary-cast`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-unnecessary-cast/))
```suggestion
output_path,
```
</issue_to_address>
### Comment 9
<location> `test/srt/nightly/test_vlms_piecewise_cuda_graph.py:123` </location>
<code_context>
def _run_vlm_mmmu_test(
self,
model,
output_path,
test_name="",
custom_env=None,
log_level="info",
capture_output=False,
):
"""
Common method to run VLM MMMU benchmark test.
Args:
model: Model to test
output_path: Path for output logs
test_name: Optional test name for logging
custom_env: Optional custom environment variables
log_level: Log level for server (default: "info")
capture_output: Whether to capture server stdout/stderr
"""
print(f"\nTesting model: {model.model}{test_name}")
process = None
mmmu_accuracy = 0 # Initialize to handle potential exceptions
server_output = ""
try:
# Prepare environment variables
process_env = os.environ.copy()
if custom_env:
process_env.update(custom_env)
# if test vlm with cuda_ipc feature, open this env_var
process_env["SGLANG_USE_CUDA_IPC_TRANSPORT"] = "1"
# Prepare stdout/stderr redirection if needed
stdout_file = None
stderr_file = None
if capture_output:
stdout_file = open("/tmp/server_stdout.log", "w")
stderr_file = open("/tmp/server_stderr.log", "w")
# Launch server for testing
process = popen_launch_server(
model.model,
base_url=self.base_url,
timeout=self.time_out,
api_key=self.api_key,
other_args=[
"--trust-remote-code",
"--piecewise-cuda-graph-max-tokens",
"8192",
"--enable-piecewise-cuda-graph",
"--tp=8",
"--piecewise-cuda-graph-compiler=eager",
"--disable-radix-cache",
"--log-level",
log_level,
],
env=process_env,
return_stdout_stderr=(
(stdout_file, stderr_file) if capture_output else None
),
)
# Run evaluation
self.run_mmmu_eval(model.model, output_path)
# Get the result file
# Search recursively for JSON result files (lmms-eval v0.4.1+ creates subdirectories)
result_files = glob.glob(f"{output_path}/**/*.json", recursive=True)
if not result_files:
result_files = glob.glob(f"{output_path}/*.json")
if not result_files:
raise FileNotFoundError(f"No JSON result files found in {output_path}")
result_file_path = result_files[0]
with open(result_file_path, "r") as f:
result = json.load(f)
print(f"Result{test_name}\n: {result}")
# Process the result
mmmu_accuracy = result["results"]["mmmu_val"]["mmmu_acc,none"]
print(
f"Model {model.model} achieved accuracy{test_name}: {mmmu_accuracy:.4f}"
)
# Capture server output if requested
if capture_output and process:
server_output = self._read_output_from_files()
# Assert performance meets expected threshold
self.assertGreaterEqual(
mmmu_accuracy,
model.mmmu_accuracy,
f"Model {model.model} accuracy ({mmmu_accuracy:.4f}) below expected threshold ({model.mmmu_accuracy:.4f}){test_name}",
)
return server_output
except Exception as e:
print(f"Error testing {model.model}{test_name}: {e}")
self.fail(f"Test failed for {model.model}{test_name}: {e}")
finally:
# Ensure process cleanup happens regardless of success/failure
if process is not None and process.poll() is None:
print(f"Cleaning up process {process.pid}")
try:
kill_process_tree(process.pid)
except Exception as e:
print(f"Error killing process: {e}")
# clean up temporary files
if capture_output:
if stdout_file:
stdout_file.close()
if stderr_file:
stderr_file.close()
for filename in ["/tmp/server_stdout.log", "/tmp/server_stderr.log"]:
try:
if os.path.exists(filename):
os.remove(filename)
except Exception as e:
print(f"Error removing {filename}: {e}")
</code_context>
<issue_to_address>
**suggestion (code-quality):** We've found these issues:
- Merge dictionary updates via the union operator ([`dict-assign-update-to-union`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/dict-assign-update-to-union/))
- Low code quality found in TestVLMPiecewiseCudaGraph.\_run\_vlm\_mmmu\_test - 21% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))
```suggestion
process_env |= custom_env
```
<br/><details><summary>Explanation</summary>
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.</details>
</issue_to_address>
### Comment 10
<location> `test/srt/nightly/test_vlms_piecewise_cuda_graph.py:231-232` </location>
<code_context>
def _read_output_from_files(self):
output_lines = []
log_files = [
("/tmp/server_stdout.log", "[STDOUT]"),
("/tmp/server_stderr.log", "[STDERR]"),
]
for filename, tag in log_files:
try:
if os.path.exists(filename):
with open(filename, "r") as f:
for line in f:
output_lines.append(f"{tag} {line.rstrip()}")
except Exception as e:
print(f"Error reading {tag.lower()} file: {e}")
return "\n".join(output_lines)
</code_context>
<issue_to_address>
**suggestion (code-quality):** Replace a for append loop with list extend ([`for-append-to-extend`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/for-append-to-extend/))
```suggestion
output_lines.extend(f"{tag} {line.rstrip()}" for line in f)
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| if skip_llm_forward: | ||
| return inputs_embeds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Returning only 'inputs_embeds' when 'skip_llm_forward' is True may break downstream code expecting a tuple.
To prevent runtime errors, ensure the return type remains consistent, or update documentation to reflect the change.
| torch.randint(0, 100, (num_tokens,), device=self.device) | ||
| if not self.use_input_embeds | ||
| else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Swap if/else branches of if expression to remove negation (swap-if-expression)
| torch.randint(0, 100, (num_tokens,), device=self.device) | |
| if not self.use_input_embeds | |
| else None | |
| None if self.use_input_embeds else torch.randint(0, 100, (num_tokens,), device=self.device) | |
Explanation
Negated conditions are more difficult to read than positive ones, so it is bestto avoid them where we can. By swapping the
if and else conditions around wecan invert the condition and make it positive.
| result_files = glob.glob(f"{output_path}/**/*.json", recursive=True) | ||
| if not result_files: | ||
| result_files = glob.glob(f"{output_path}/*.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Use or for providing a fallback value (use-or-for-fallback)
| result_files = glob.glob(f"{output_path}/**/*.json", recursive=True) | |
| if not result_files: | |
| result_files = glob.glob(f"{output_path}/*.json") | |
| result_files = glob.glob(f"{output_path}/**/*.json", recursive=True) or glob.glob(f"{output_path}/*.json") | |
Explanation
Thanks to the flexibility of Python'sor operator, you can use a singleassignment statement, even if a variable can retrieve its value from various
sources. This is shorter and easier to read than using multiple assignments with
if not conditions.
| # Placeholder for post_process | ||
| new_embeddings = [] | ||
| for i, (modality, embedding, index) in enumerate( | ||
| zip(modalities, embeddings, indices) | ||
| ): | ||
| if embedding is None or index is None: | ||
| continue | ||
|
|
||
| new_embeddings.append(embedding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): We've found these issues:
- Convert for loop into list comprehension (
list-comprehension) - Remove unnecessary calls to
enumeratewhen the index is not used (remove-unused-enumerate) - Lift code into else after jump in control flow (
reintroduce-else) - Remove redundant continue statement (
remove-redundant-continue)
| # Placeholder for post_process | |
| new_embeddings = [] | |
| for i, (modality, embedding, index) in enumerate( | |
| zip(modalities, embeddings, indices) | |
| ): | |
| if embedding is None or index is None: | |
| continue | |
| new_embeddings.append(embedding) | |
| new_embeddings = [ | |
| embedding | |
| for modality, embedding, index in zip(modalities, embeddings, indices) | |
| if embedding is not None and index is not None | |
| ] |
| "--batch_size", | ||
| str(batch_size), | ||
| "--output_path", | ||
| str(output_path), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Remove unnecessary casts to int, str, float or bool (remove-unnecessary-cast)
| str(output_path), | |
| output_path, |
| # Prepare environment variables | ||
| process_env = os.environ.copy() | ||
| if custom_env: | ||
| process_env.update(custom_env) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): We've found these issues:
- Merge dictionary updates via the union operator (
dict-assign-update-to-union) - Low code quality found in TestVLMPiecewiseCudaGraph._run_vlm_mmmu_test - 21% (
low-code-quality)
| process_env.update(custom_env) | |
| process_env |= custom_env |
Explanation
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines. - Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.
| for line in f: | ||
| output_lines.append(f"{tag} {line.rstrip()}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Replace a for append loop with list extend (for-append-to-extend)
| for line in f: | |
| output_lines.append(f"{tag} {line.rstrip()}") | |
| output_lines.extend(f"{tag} {line.rstrip()}" for line in f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for piecewise CUDA graph execution for Qwen2.5-VL and other multimodal models, which is a significant feature. The approach of externalizing the multimodal preprocessing and using a model-specific post_process method is a solid design. My review includes a few suggestions to enhance the design's extensibility, correct a type hint, and clean up some minor code issues.
| # The following is just for qwen3vl, maybe not ideal to place it here. | ||
| self.use_deepstack = getattr(model_runner.model, "use_deepstack", False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment correctly points out that this Qwen3-VL-specific logic is not ideal here. A generic runner like PiecewiseCudaGraphRunner should not contain logic tied to a specific model. This makes the code harder to maintain and extend.
Consider moving this logic or making it more generic. For example, the runner could check for the presence of use_deepstack and deepstack_visual_indexes attributes on the model object and derive the necessary configuration from there, rather than being hardcoded for a specific model family.
| def should_use_external_mm_preprocess(multimodal_model: nn.Module) -> bool: | ||
| """Decide whether we should use our generic "multimodal_preprocess_routine". | ||
| We are adapting VLM for piecewise CUDA graph. Since the encoder's forward | ||
| pass cannot be executed within the model's forward pass, we need to | ||
| precompute image embeddings using the encoder within the model runner. | ||
| For models that have already been adjusted, there is a member called | ||
| should_use_external_mm_preprocess, which is set to True. In practice, | ||
| the multimodal_preprocess_routine function will be called in the | ||
| model_runner.forward_extend to handle multimodal inputs. | ||
| For models that have not yet been adapted, the general_mm_embed_routine | ||
| will still be called in the model class's forward function for processing. | ||
| Current strategy: | ||
| - Llava family (models with vision_tower + multi_modal_projector): | ||
| Their forward already calls general_mm_embed_routine and includes | ||
| built-in multimodal processing. If we run it again in ModelRunner, | ||
| it will conflict with the internal logic, so we skip it here. | ||
| - Others (such as Qwen2-VL / Qwen2.5-VL / Qwen3-VL): use the | ||
| multimodal preprocessing. | ||
| """ | ||
|
|
||
| cls_name = multimodal_model.__class__.__name__ | ||
|
|
||
| qwen_vl_classes = { | ||
| "Qwen2VLForConditionalGeneration", | ||
| "Qwen2_5_VLForConditionalGeneration", | ||
| "Qwen3VLForConditionalGeneration", | ||
| "Qwen3VLMoeForConditionalGeneration", | ||
| "Qwen3OmniMoeForConditionalGeneration", | ||
| } | ||
|
|
||
| return cls_name in qwen_vl_classes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a hardcoded set of class names in should_use_external_mm_preprocess makes it difficult to extend this functionality to new models in the future. A more robust and maintainable approach would be to use feature detection on the model object itself. For instance, you could check for the existence of a specific attribute (e.g., model.supports_external_mm_preprocess = True) or a method (e.g., hasattr(model, "post_process")). This would make the logic more generic and decoupled from specific model implementations.
| data_embedding_funcs: Dict[ | ||
| Modality, Callable[[List[MultimodalDataItem]], torch.Tensor] | ||
| ] = None, | ||
| ) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| num_tokens = len(forward_batch.input_ids) | ||
|
|
||
| if self.use_deepstack: | ||
| self.input_deepstack_embeds.zero_() # may be removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment may be removed is unclear. If zeroing out self.input_deepstack_embeds is necessary to prevent stale data in the CUDA graph, the comment should state that for clarity. If it's truly optional, please remove the line and the comment to avoid confusion.
| self.input_deepstack_embeds.zero_() # may be removed. | |
| self.input_deepstack_embeds.zero_() # Clear stale data for CUDA graph correctness. |
| ) -> torch.Tensor: | ||
| if not self.use_deepstack: | ||
| return embeddings, forward_batch | ||
| deepstack_embeddings = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Motivation
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Summary by Sourcery
Add support for piecewise CUDA graph execution for multimodal Qwen2.5-VL (and related QwenVL models) by externalizing multimodal preprocessing and extending graph runners to handle embeddings directly.
New Features:
Enhancements:
Tests: