[Performance] Eliminate unnecessary H2D copies in FlashInfer decode #21854

MatthewBonanni · 2025-07-29T17:22:24Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

In the FlashInfer backend, BatchDecodeWithPagedKVCacheWrapper::plan() copies tensors to device, which is unnecessary when device copies already exist. This PR refactors to eliminate these copies.

Test Plan

Correctness:

pytest tests/v1/attention/test_attention_backends.py

Performance:

vllm bench throughput \
        --model "NousResearch/Hermes-3-Llama-3.1-8B" \
        --dataset-name "random" \
        --input-len 128 \
        --output-len 512 \
        --num-prompts 100

Test Result

Correctness passes. FlashInferMetadataBuilder::_plan() duration reduced from 150us to 100us.

(Optional) Documentation Update

…of indptr and last_page_len Signed-off-by: Matthew Bonanni <[email protected]>

Signed-off-by: Matthew Bonanni <[email protected]>

gemini-code-assist

Code Review

This pull request optimizes the FlashInfer backend by eliminating unnecessary host-to-device memory copies, leading to a performance improvement. The approach of pre-creating tensors on the device and passing them down is sound.

The main concern is the introduction of the fast_decode_plan function, which duplicates code from the flashinfer library, posing a future maintenance risk. I've provided a detailed comment with suggestions to improve the maintainability of this new function.

gemini-code-assist · 2025-07-29T17:23:43Z

vllm/v1/attention/backends/flashinfer.py

+def fast_decode_plan(
+    self,


This new function fast_decode_plan appears to be a modified copy of flashinfer.decode.BatchDecodeWithPagedKVCacheWrapper.plan. While this is effective for the performance gain, it introduces a maintenance risk by duplicating code from an external library. If the upstream flashinfer library updates its plan method, this code may become outdated.

The ideal long-term solution would be to contribute this change upstream to flashinfer.

For the current implementation, I have a few suggestions to improve clarity and maintainability:

Function Signature: The first argument is named self, which is idiomatic for instance methods but confusing for a standalone function. Renaming it to wrapper would clarify that it expects an instance of BatchDecodeWithPagedKVCacheWrapper.

Add Explanatory Comment: A comment at the beginning of the function explaining why it's a copy of the upstream code and what changes were made would be very helpful for future maintainers. For example:
# This function is a modified version of flashinfer's # BatchDecodeWithPagedKVCacheWrapper.plan method. It's been refactored # to accept pre-allocated device tensors (indptr, last_page_len) to # avoid unnecessary H2D copies within the planning phase. # TODO(author): Consider upstreaming this change to flashinfer.

Visibility: Since this function is only used within this module, consider renaming it to _fast_decode_plan to indicate it's a private helper function.

Suggested change

def fast_decode_plan(

self,

def _fast_decode_plan(

wrapper,

github-actions · 2025-07-29T17:29:57Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Matthew Bonanni <[email protected]>

MatthewBonanni · 2025-07-29T18:13:51Z

vllm/v1/attention/backends/flashinfer.py

        paged_kv_indptr_cpu[1:] = block_table_bounds_cpu.cumsum(
            dim=0, dtype=torch.int32)

+        paged_kv_indptr = torch.zeros(len(block_table_bounds) + 1,
+                                      dtype=torch.int32,
+                                      device=self.device)
+        paged_kv_indptr[1:] = block_table_bounds.cumsum(dim=0,
+                                                        dtype=torch.int32)
+
        paged_kv_last_page_len_cpu = seq_lens_cpu % page_size
        paged_kv_last_page_len_cpu = torch.where(
            paged_kv_last_page_len_cpu == 0, page_size,
            paged_kv_last_page_len_cpu)
+
+        paged_kv_last_page_len = seq_lens % page_size
+        paged_kv_last_page_len = torch.where(paged_kv_last_page_len == 0,
+                                             page_size, paged_kv_last_page_len)
+


@LucasWilkinson am I correct that the device versions of these tensors must be computed as I've done here? Or are they available elsewhere? I'm showing that the cost of this compute negates the benefit of eliminating the copies

ya we would need to compute these on device, however that should be non-blocking on the CPU side (the bottleneck); are you seeing alot of CPU overhead in the trace?

I see the where is taking long time

we might be able to play some tricks here like doing (I think this correct?!?):

paged_kv_last_page_len = (seq_lens % -page_size) + page_size

instead of

paged_kv_last_page_len = seq_lens % page_size paged_kv_last_page_len = torch.where(paged_kv_last_page_len == 0, page_size, paged_kv_last_page_len)

Done, this and your other suggestion about cumsum() helped!

do you have any new numbers?

Signed-off-by: Matthew Bonanni <[email protected]>

MatthewBonanni added 3 commits July 28, 2025 16:42

Initial implementation of fast_decode_plan, eliminates H2D transfers …

87d0f96

…of indptr and last_page_len Signed-off-by: Matthew Bonanni <[email protected]>

Eliminate D2H copy

41c5702

Signed-off-by: Matthew Bonanni <[email protected]>

Formatting

7b8fa00

Signed-off-by: Matthew Bonanni <[email protected]>

mergify bot added the v1 label Jul 29, 2025

gemini-code-assist bot reviewed Jul 29, 2025

View reviewed changes

Address gemini comments

9fcd3d2

Signed-off-by: Matthew Bonanni <[email protected]>

MatthewBonanni force-pushed the feature/flashinfer_fast_decode_plan branch from 5317449 to 9fcd3d2 Compare July 29, 2025 17:46

MatthewBonanni changed the title ~~[Performance] Eliminate unnecessary H2D copies in FlashInfer backend~~ [Performance] Eliminate unnecessary H2D copies in FlashInfer decode Jul 29, 2025

MatthewBonanni commented Jul 29, 2025

View reviewed changes

Optimize some compute ops

14bde4f

Signed-off-by: Matthew Bonanni <[email protected]>

MatthewBonanni marked this pull request as ready for review July 29, 2025 19:44

MatthewBonanni requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners July 29, 2025 19:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance] Eliminate unnecessary H2D copies in FlashInfer decode #21854

[Performance] Eliminate unnecessary H2D copies in FlashInfer decode #21854

MatthewBonanni commented Jul 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 29, 2025

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

MatthewBonanni Jul 29, 2025

Uh oh!

LucasWilkinson Jul 29, 2025

Uh oh!

LucasWilkinson Jul 29, 2025

Uh oh!

MatthewBonanni Jul 29, 2025

Uh oh!

LucasWilkinson Jul 30, 2025

Uh oh!

Uh oh!

-def fast_decode_plan(
-    self,
+def _fast_decode_plan(
+    wrapper,

Uh oh!

[Performance] Eliminate unnecessary H2D copies in FlashInfer decode #21854

Are you sure you want to change the base?

[Performance] Eliminate unnecessary H2D copies in FlashInfer decode #21854

Conversation

MatthewBonanni commented Jul 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

MatthewBonanni Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

MatthewBonanni Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MatthewBonanni commented Jul 29, 2025 •

edited by github-actions bot

Loading