Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] [PyTorch Flow] Re-implement LlmRequest and Scheduler in pure Python #3034

Open
QiJune opened this issue Mar 24, 2025 · 2 comments
Open
Labels

Comments

@QiJune
Copy link
Collaborator

QiJune commented Mar 24, 2025

Motivation

Currently, we have a pure Python based PyExecutor class, which handles the main event loop. It provides good flexibility to support features like overlap scheduler and attention data parallelism quickly.

Inside it, we still use lots of pybind classes, including LlmRequest, KVCacheManager and Scheduler.

To improve flexibility further, we want to migrate more components from C++ to Python.

Analysis

LlmRequest, KVCacheManager and Scheduler are coupled tightly. There are many state tensors maintained in LlmRequest, including output_tokens/chunk_size/state/etc. Both KVCacheManager and Scheduler read and write members of LlmRequest internally.

We tried to implement a pure Python CapacityScheduler before, but it introduces too much pybind calls of LlmRequest. We observed pybind calls are about 2X-3X slower than pure Python calls. So, we don’t enable this pure Python CapacitySchedule due to its big host overhead.

Considering the complexity of KVCacheManager, we decide to re-implement LlmRequest and Scheduler in pure Python as the first step. At the same time, we will remove LlmRequest from the KVCacheManager interface.

Proposed Solution

  1. Introduce a new flag enable_pure_python_scheduler in PyTorchConfig to enable pure Python based scheduler

Considering it takes some time to migrate all the components and do performance tuning, pure Python based scheduler will be hidden from users at the begining.

  1. Refactor LlmRequest to support maintaining all state tensors in Python side

All state tensors will be "duplicated" in Python side first. The member functions of LlmRequest will be dispatched to different paths depending on enable_pure_python_scheduler flag.

class LlmRequest(tensorrt_llm.bindings.internal.batch_manager.LlmRequest):
    def __init__(self, *args, enable_pure_python_scheduler, **kwargs):
        super().__init__(*args, **kwargs)
        self.enable_pure_python_scheduler
        self.py_request_id = self.request_id
        self.py_state = self.state
        self.py_tokens = [[] for i in range(self.sampling_config.beam_width)]

    def get_tokens(self, beam_idx: int):
        if self.enable_pure_python_scheduler:
            # dispatch to pure Python path
            return self.py_tokens[beam_idx]
        else:
            # dispatch to pybind path
            return self.get_tokens(beam_idx)
  1. Implement pure Python based Scheduler

  2. Decouple LlmRequest from KVCacheManager interface

Future Works

We need some time to do performance tuning. After that, let's evaluate the possibility to enable pure Python based Scheduler by default.

@QiJune QiJune closed this as completed Mar 25, 2025
@QiJune QiJune reopened this Mar 25, 2025
@QiJune QiJune closed this as completed Mar 25, 2025
@QiJune QiJune changed the title [RFC] Migrate more components of PyExecutor from C++ to Python [RFC] [PyTorch Flow] Re-implement LlmRequest and Scheduler in pure Python side Mar 25, 2025
@QiJune QiJune changed the title [RFC] [PyTorch Flow] Re-implement LlmRequest and Scheduler in pure Python side [RFC] [PyTorch Flow] Re-implement LlmRequest and Scheduler in pure Python Mar 25, 2025
@QiJune QiJune reopened this Mar 25, 2025
@QiJune QiJune added the RFC label Mar 25, 2025
@BrechtCorbeel
Copy link

Sounds like a solid plan. Moving more components to Python should definitely make things more flexible, and introducing the enable_pure_python_scheduler flag seems like a smart way to manage the transition without impacting users too soon.

One thing I’m curious about is the performance trade-off. Since the previous pure Python CapacityScheduler ran into issues with pybind overhead, are there any benchmarks planned to compare the new implementation? Also, with state tensors being duplicated on the Python side, is there any concern about increased memory usage?

Decoupling LlmRequest from KVCacheManager makes sense, but I imagine integrating the new setup with KVCacheManager later could come with its own challenges. Would be interesting to see how that plays out.

@QiJune
Copy link
Collaborator Author

QiJune commented Mar 31, 2025

Hi @BrechtCorbeel, thanks for your comments.

Since the previous pure Python CapacityScheduler ran into issues with pybind overhead, are there any benchmarks planned to compare the new implementation?

In the previous implementation, only CapacityScheduler is in pure Python, LlmRequest is still a pybind class. Thus, there are many pybind calls inside the scheduling loop.

for request in active_requests:
req_state = request.state
# if request cannot be scheduled yet or request should no longer be scheduled, skip
if req_state.value < self.no_schedule_until_state.value or req_state.value >= self.no_schedule_after_state.value:
continue
if len(scheduled_requests
) >= self.max_num_requests or reserved_blocks >= max_blocks:
break
elif req_state == LlmRequestState.GENERATION_IN_PROGRESS or req_state == LlmRequestState.GENERATION_TO_COMPLETE:
scheduled_requests.append(request)
reserved_blocks += self.kv_cache_manager.get_needed_resource_to_completion(
request)
else:
pending_requests.append(request)

In the proposed new implementaion, Scheduler and LlmRequest will both be in pure Python. So, we expect pybind calls will be reduced greatly. We plan to do some benchmarks on llama 8B (single GPU) and 70B (multi-GPUs).

Also, with state tensors being duplicated on the Python side, is there any concern about increased memory usage?

The duplicated state tensors is to support to pure Python path, which will occupy memory when enable_pure_python_scheduler is True. At the same time, the state tensors in C++ side remains empty. So, there should be no more memory usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants