[RFC] [PyTorch Flow] Re-implement LlmRequest and Scheduler in pure Python #3034

QiJune · 2025-03-24T15:46:22Z

Motivation

Currently, we have a pure Python based PyExecutor class, which handles the main event loop. It provides good flexibility to support features like overlap scheduler and attention data parallelism quickly.

Inside it, we still use lots of pybind classes, including LlmRequest, KVCacheManager and Scheduler.

To improve flexibility further, we want to migrate more components from C++ to Python.

Analysis

LlmRequest, KVCacheManager and Scheduler are coupled tightly. There are many state tensors maintained in LlmRequest, including output_tokens/chunk_size/state/etc. Both KVCacheManager and Scheduler read and write members of LlmRequest internally.

We tried to implement a pure Python CapacityScheduler before, but it introduces too much pybind calls of LlmRequest. We observed pybind calls are about 2X-3X slower than pure Python calls. So, we don’t enable this pure Python CapacitySchedule due to its big host overhead.

Considering the complexity of KVCacheManager, we decide to re-implement LlmRequest and Scheduler in pure Python as the first step. At the same time, we will remove LlmRequest from the KVCacheManager interface.

Proposed Solution

Introduce a new flag enable_pure_python_scheduler in PyTorchConfig to enable pure Python based scheduler

Considering it takes some time to migrate all the components and do performance tuning, pure Python based scheduler will be hidden from users at the begining.

Refactor LlmRequest to support maintaining all state tensors in Python side

All state tensors will be "duplicated" in Python side first. The member functions of LlmRequest will be dispatched to different paths depending on enable_pure_python_scheduler flag.

class LlmRequest(tensorrt_llm.bindings.internal.batch_manager.LlmRequest):
    def __init__(self, *args, enable_pure_python_scheduler, **kwargs):
        super().__init__(*args, **kwargs)
        self.enable_pure_python_scheduler
        self.py_request_id = self.request_id
        self.py_state = self.state
        self.py_tokens = [[] for i in range(self.sampling_config.beam_width)]

    def get_tokens(self, beam_idx: int):
        if self.enable_pure_python_scheduler:
            # dispatch to pure Python path
            return self.py_tokens[beam_idx]
        else:
            # dispatch to pybind path
            return self.get_tokens(beam_idx)

Implement pure Python based Scheduler
Decouple LlmRequest from KVCacheManager interface

Future Works

We need some time to do performance tuning. After that, let's evaluate the possibility to enable pure Python based Scheduler by default.

The text was updated successfully, but these errors were encountered:

BrechtCorbeel · 2025-03-27T08:42:14Z

Sounds like a solid plan. Moving more components to Python should definitely make things more flexible, and introducing the enable_pure_python_scheduler flag seems like a smart way to manage the transition without impacting users too soon.

One thing I’m curious about is the performance trade-off. Since the previous pure Python CapacityScheduler ran into issues with pybind overhead, are there any benchmarks planned to compare the new implementation? Also, with state tensors being duplicated on the Python side, is there any concern about increased memory usage?

Decoupling LlmRequest from KVCacheManager makes sense, but I imagine integrating the new setup with KVCacheManager later could come with its own challenges. Would be interesting to see how that plays out.

QiJune · 2025-03-31T02:48:53Z

Hi @BrechtCorbeel, thanks for your comments.

Since the previous pure Python CapacityScheduler ran into issues with pybind overhead, are there any benchmarks planned to compare the new implementation?

In the previous implementation, only CapacityScheduler is in pure Python, LlmRequest is still a pybind class. Thus, there are many pybind calls inside the scheduling loop.

TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/scheduler.py

Lines 111 to 125 in 794f61c

    
           for request in active_requests: 
        
               req_state = request.state 
        
               # if request cannot be scheduled yet or request should no longer be scheduled, skip 
        
               if req_state.value < self.no_schedule_until_state.value or req_state.value >= self.no_schedule_after_state.value: 
        
                   continue 
        
               if len(scheduled_requests 
        
                      ) >= self.max_num_requests or reserved_blocks >= max_blocks: 
        
                   break 
        
               elif req_state == LlmRequestState.GENERATION_IN_PROGRESS or req_state == LlmRequestState.GENERATION_TO_COMPLETE: 
        
                   scheduled_requests.append(request) 
        
                   reserved_blocks += self.kv_cache_manager.get_needed_resource_to_completion( 
        
                       request) 
        
               else: 
        
                   pending_requests.append(request)

In the proposed new implementaion, Scheduler and LlmRequest will both be in pure Python. So, we expect pybind calls will be reduced greatly. We plan to do some benchmarks on llama 8B (single GPU) and 70B (multi-GPUs).

Also, with state tensors being duplicated on the Python side, is there any concern about increased memory usage?

The duplicated state tensors is to support to pure Python path, which will occupy memory when enable_pure_python_scheduler is True. At the same time, the state tensors in C++ side remains empty. So, there should be no more memory usage.

QiJune closed this as completed Mar 25, 2025

QiJune reopened this Mar 25, 2025

QiJune closed this as completed Mar 25, 2025

QiJune changed the title ~~[RFC] Migrate more components of PyExecutor from C++ to Python~~ [RFC] [PyTorch Flow] Re-implement LlmRequest and Scheduler in pure Python side Mar 25, 2025

QiJune changed the title ~~[RFC] [PyTorch Flow] Re-implement LlmRequest and Scheduler in pure Python side~~ [RFC] [PyTorch Flow] Re-implement LlmRequest and Scheduler in pure Python Mar 25, 2025

QiJune reopened this Mar 25, 2025

QiJune added the RFC label Mar 25, 2025

QiJune mentioned this issue Apr 7, 2025

fix: fix the py_decoding_iter update in decoder #3297

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] [PyTorch Flow] Re-implement LlmRequest and Scheduler in pure Python #3034

[RFC] [PyTorch Flow] Re-implement LlmRequest and Scheduler in pure Python #3034

QiJune commented Mar 24, 2025 •

edited

Loading

BrechtCorbeel commented Mar 27, 2025

QiJune commented Mar 31, 2025 •

edited

Loading

[RFC] [PyTorch Flow] Re-implement LlmRequest and Scheduler in pure Python #3034

[RFC] [PyTorch Flow] Re-implement LlmRequest and Scheduler in pure Python #3034

Comments

QiJune commented Mar 24, 2025 • edited Loading

Motivation

Analysis

Proposed Solution

Future Works

BrechtCorbeel commented Mar 27, 2025

QiJune commented Mar 31, 2025 • edited Loading

QiJune commented Mar 24, 2025 •

edited

Loading

QiJune commented Mar 31, 2025 •

edited

Loading