[core] Adopt graph rewriter on fx.graph to enable automatic kernel fusion #2389

ganyi1996ppo · 2025-08-15T03:29:19Z

What this PR does / why we need it?

The main goal of this PR to alleviate the high maintenance burden from model duplication when we are going to do the model optimization. Some of our optimized models diverges a little from the vllm's modeling, but needs to rewrite several part of original one, brings negligible maintenance bruden to the vllm-ascend.

In order to solve that, we propose to leverage torch.compile, and automatically fuse the pattern we want to merge. For more details can refer to the RFC #2386

Does this PR introduce any user-facing change?

Yes, we add a new additional_config

How was this patch tested?

vLLM version: v0.10.0
vLLM main: vllm-project/vllm@01a0873

github-actions · 2025-08-15T03:29:30Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces a graph rewriter for fx.graph to enable automatic kernel fusion on Ascend hardware. The changes include a new compiler interface, a graph rewrite pass manager, a specific fusion pass for AddRMSNorm and quantization, and corresponding patches and tests. While the overall approach is sound, I've identified several critical issues related to correctness in the fusion logic, configuration, and testing. Specifically, there are errors in handling operator return values, incorrect configuration in the patch files, and incomplete tests that hide bugs. These issues must be addressed to ensure the feature works correctly.

tests/e2e/singlecard/test_graph_rewriter.py

gemini-code-assist · 2025-08-15T03:30:58Z

vllm_ascend/patch/worker/patch_common/patch_compilation.py

+
+
+def make_compiler(compilation_config: CompilationConfig) -> CompilerInterface:
+    return AscendAdaptor


The make_compiler function should return an instance of the compiler class, not the class itself. The caller expects an object that implements the CompilerInterface.

Suggested change

return AscendAdaptor

def make_compiler(compilation_config: CompilationConfig) -> CompilerInterface:

return AscendAdaptor()

gemini-code-assist · 2025-08-15T03:30:59Z

vllm_ascend/patch/worker/patch_common/patch_compilation.py

+    PASS_KEY = "graph_rewriter_pass"
+    inductor_config[PASS_KEY] = self.post_grad_pass_manager


There are two issues here that will cause the compilation to fail at runtime:

The key used to store the pass manager in inductor_config is "graph_rewriter_pass". However, AscendAdaptor expects the key to be "graph_rewriter_manager".

The value assigned is self.post_grad_pass_manager, which is the original vLLM Inductor pass manager. It should be the newly created self.graph_rewriter_pass_manager.

This will lead to a KeyError and the use of the wrong pass manager.

Suggested change

PASS_KEY = "graph_rewriter_pass"

inductor_config[PASS_KEY] = self.post_grad_pass_manager

PASS_KEY = "graph_rewriter_manager"

inductor_config[PASS_KEY] = self.graph_rewriter_pass_manager

vllm_ascend/compilation/quant_fusion_pass.py

gemini-code-assist · 2025-08-15T03:30:59Z

vllm_ascend/compilation/quant_fusion_pass.py

+              offset, 
+              epsilon=1e-6)
+          quantized_output = output[0]
+          residual = output[2]


The replace function incorrectly uses output[2] as the residual. Assuming torch.ops.npu.npu_add_rms_norm_quant has a similar return signature to npu_add_rms_norm, the residual should be output[1]. Using output[2] will result in a functionally incorrect fused operation.

Suggested change

residual = output[2]

residual = output[1]

gemini-code-assist · 2025-08-15T03:30:59Z

tests/e2e/singlecard/test_graph_rewriter.py

+        self.weight = nn.Parameter(torch.Tensor(hidden_size))
+        self.bias = nn.Parameter(torch.Tensor(hidden_size))


The weight and bias tensors are created with torch.Tensor(), which leaves them with uninitialized data. This can lead to non-deterministic behavior and flaky tests. It's important to initialize these parameters to ensure test reproducibility.

Suggested change

self.weight = nn.Parameter(torch.Tensor(hidden_size))

self.bias = nn.Parameter(torch.Tensor(hidden_size))

self.weight = nn.Parameter(torch.ones(hidden_size))

self.bias = nn.Parameter(torch.zeros(hidden_size))

gemini-code-assist · 2025-08-15T03:30:59Z

vllm_ascend/compilation/graph_rewrite_pass_manager.py

+from vllm.logger import init_logger
+from vllm.compilation.vllm_inductor_pass import VllmInductorPass
+from vllm.compilation.inductor_pass import get_pass_context, InductorPass
+from quant_fusion_pass import AscendQuantFusionPass


The import from quant_fusion_pass import AscendQuantFusionPass is an implicit relative import. This is not recommended as it is fragile and can fail depending on the execution context. It should be an explicit relative import.

Suggested change

from quant_fusion_pass import AscendQuantFusionPass

from .quant_fusion_pass import AscendQuantFusionPass

github-actions · 2025-08-15T04:18:51Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

ApsarasX · 2025-08-15T06:13:28Z

👍🏻👍🏻👍🏻This's a super smart idea – using compiler tricks to make model code way simpler.

Signed-off-by: ganyi <[email protected]>

jgong5 · 2025-08-19T00:56:38Z

vllm_ascend/ascend_config.py

+    Configuration Object for ascend_compilation_config from additional_config
+    """
+
+    def __init__(self, ascend_compilation_config: dict):


Suggest to explicitly name the options as args (e.g. enable_graph_rewriter = True ...) and if you want things extensible, I guess **kwargs is more pythonic.

Good suggestion, I'll change this

jgong5 · 2025-08-19T00:59:26Z

docs/source/user_guide/configuration/additional_config.md

+**ascend_compilation_config**
+| Name | Type | Default | Description |
+| ---- | ---- | ------- | ----------- |
+| `enable_graph_rewrite` | bool | `True` | Whether to enable the graph rewriter to rewrite the fx graph generated by torch.compile |


Shall we highlight that this option is a primary flag that could turn off all the compilation and make the other compiler options ignored?

Got, I'll emphasis this one

jgong5 · 2025-08-19T01:03:14Z

docs/source/user_guide/configuration/additional_config.md

+| Name | Type | Default | Description |
+| ---- | ---- | ------- | ----------- |
+| `enable_graph_rewrite` | bool | `True` | Whether to enable the graph rewriter to rewrite the fx graph generated by torch.compile |
+| `enable_quantization_fusion` | bool | `True` | Whether to enable the fusion pass on op + quantize, this should remain open by default to benefit all users for performance boost |


Not sure if we really want to expose this as an official configuration flag here. I feel this complicate the UI for normal users. Some considerations:

What is the granularity of such configurations? For example, is it better to make each fusion configurable, e.g., naming it enable_rmsnorm_quant_fusion?

Shall we expose this to normal users? Or, is it better to make it a private option first, e.g., _enable_quantization_fusion so that we make it as a private and debugging-only flag?

This is mainly refer to the vllm's official code. Its true that the granularity here is quite obscure in this PR, but this flag is more like a safety trigger here to make sure that if anything goes wrong, we can quickly guide our customer to bypass some issue.

jgong5 · 2025-08-19T01:08:05Z

vllm_ascend/patch/__init__.py

+#    Related PR (if no, explain why):
+#       - We might add PR to make vllm support custom compiler interface. But its not sure yet.
+#    Future Plan:
+#       We might push the customized compiler interface to the vllm main repo, and leave the backend selection to the platform itself.


Do I understand correctly that vLLM hard-coded "inductor" as the compiler backend for piece-wise graphs? Is there a way to plugin the custom compiler backend instead of "inductor"?

No, vllm actually rewrite its own backend called 'VllmBackend', and inside of that vllm will do its own graph break and pattern register routine to make sure the optimization and compatibility to other repos and packages

jgong5 · 2025-08-19T01:08:51Z

vllm_ascend/compilation/quant_fusion_pass.py

+              rms_norm_input, 
+              residual, 
+              rms_norm_weight, 
+              1. / scale, 


I guess this worth some comments? :)

jgong5 · 2025-08-19T01:10:19Z

vllm_ascend/compilation/quant_fusion_pass.py

+        super().__init__(vllm_config)
+        self.patterns = []
+        # Register the AddRMSNormQuant fusion pattern into the graph rewriter pattern list
+        AddRMSNormQuantPattern(vllm_config).register(self.patterns)


Perhaps it is better to use some decorator to register new patterns, following open/closed principle...

Signed-off-by: ganyi <[email protected]>

github-actions · 2025-08-20T01:03:13Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

gemini-code-assist bot reviewed Aug 15, 2025

View reviewed changes

github-actions bot added module:tests merge-conflicts labels Aug 15, 2025

github-actions bot added the module:ops label Aug 15, 2025

ganyi1996ppo force-pushed the ganyi/graph_rewrite branch from 83bc5be to e7d8a01 Compare August 15, 2025 06:10

github-actions bot removed the merge-conflicts label Aug 15, 2025

github-actions bot added documentation Improvements or additions to documentation module:core labels Aug 15, 2025

ganyi1996ppo requested review from Yikun, wangxiyuan, MengqingCao and jgong5 August 15, 2025 09:57

ganyi1996ppo added 7 commits August 18, 2025 11:36

add graph rewriter

c5f267c

Signed-off-by: ganyi <[email protected]>

finish the test code

0f02680

Signed-off-by: ganyi <[email protected]>

init commit

b4bbd8d

Signed-off-by: ganyi <[email protected]>

add comments, license, and remove some non necessary code

8a00d80

Signed-off-by: ganyi <[email protected]>

add the additional config to the doc

9fb715f

Signed-off-by: ganyi <[email protected]>

add model test for the graph fusion

f208cfa

Signed-off-by: ganyi <[email protected]>

fix some typo error

3b52d53

Signed-off-by: ganyi <[email protected]>

ganyi1996ppo force-pushed the ganyi/graph_rewrite branch from 80afd39 to 3b52d53 Compare August 18, 2025 06:13

ganyi1996ppo marked this pull request as ready for review August 18, 2025 06:13

jgong5 reviewed Aug 19, 2025

View reviewed changes

ganyi1996ppo added 4 commits August 19, 2025 09:54

add new flag fx_graph_eager for inference in eager mode on fx graph

329c29d

Signed-off-by: ganyi <[email protected]>

reformat and solve some commentS

6291098

Signed-off-by: ganyi <[email protected]>

bugfix

5496e7f

Signed-off-by: ganyi <[email protected]>

add rtol and atol for allclose compare

4a9eca2

Signed-off-by: ganyi <[email protected]>

github-actions bot added the merge-conflicts label Aug 20, 2025



		def make_compiler(compilation_config: CompilationConfig) -> CompilerInterface:
		return AscendAdaptor

	return AscendAdaptor
	def make_compiler(compilation_config: CompilationConfig) -> CompilerInterface:
	return AscendAdaptor()

		PASS_KEY = "graph_rewriter_pass"
		inductor_config[PASS_KEY] = self.post_grad_pass_manager

		self.weight = nn.Parameter(torch.Tensor(hidden_size))
		self.bias = nn.Parameter(torch.Tensor(hidden_size))

	from quant_fusion_pass import AscendQuantFusionPass
	from .quant_fusion_pass import AscendQuantFusionPass

[core] Adopt graph rewriter on fx.graph to enable automatic kernel fusion #2389

Are you sure you want to change the base?

[core] Adopt graph rewriter on fx.graph to enable automatic kernel fusion #2389

Conversation

ganyi1996ppo commented Aug 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Aug 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 15, 2025

Uh oh!

ApsarasX commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 20, 2025

Uh oh!

Uh oh!

ganyi1996ppo commented Aug 15, 2025 •

edited by github-actions bot

Loading

ApsarasX commented Aug 15, 2025 •

edited

Loading