-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] [PyTorch Flow] Re-implement LlmRequest and Scheduler in pure Python #3034
Comments
Sounds like a solid plan. Moving more components to Python should definitely make things more flexible, and introducing the enable_pure_python_scheduler flag seems like a smart way to manage the transition without impacting users too soon. One thing I’m curious about is the performance trade-off. Since the previous pure Python CapacityScheduler ran into issues with pybind overhead, are there any benchmarks planned to compare the new implementation? Also, with state tensors being duplicated on the Python side, is there any concern about increased memory usage? Decoupling LlmRequest from KVCacheManager makes sense, but I imagine integrating the new setup with KVCacheManager later could come with its own challenges. Would be interesting to see how that plays out. |
Hi @BrechtCorbeel, thanks for your comments.
In the previous implementation, only CapacityScheduler is in pure Python, LlmRequest is still a pybind class. Thus, there are many pybind calls inside the scheduling loop. TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/scheduler.py Lines 111 to 125 in 794f61c
In the proposed new implementaion, Scheduler and LlmRequest will both be in pure Python. So, we expect pybind calls will be reduced greatly. We plan to do some benchmarks on llama 8B (single GPU) and 70B (multi-GPUs).
The duplicated state tensors is to support to pure Python path, which will occupy memory when |
Motivation
Currently, we have a pure Python based PyExecutor class, which handles the main event loop. It provides good flexibility to support features like overlap scheduler and attention data parallelism quickly.
Inside it, we still use lots of pybind classes, including LlmRequest, KVCacheManager and Scheduler.
To improve flexibility further, we want to migrate more components from C++ to Python.
Analysis
LlmRequest
,KVCacheManager
andScheduler
are coupled tightly. There are many state tensors maintained inLlmRequest
, including output_tokens/chunk_size/state/etc. BothKVCacheManager
andScheduler
read and write members ofLlmRequest
internally.We tried to implement a pure Python CapacityScheduler before, but it introduces too much pybind calls of LlmRequest. We observed pybind calls are about 2X-3X slower than pure Python calls. So, we don’t enable this pure Python CapacitySchedule due to its big host overhead.
Considering the complexity of
KVCacheManager
, we decide to re-implementLlmRequest
andScheduler
in pure Python as the first step. At the same time, we will removeLlmRequest
from theKVCacheManager
interface.Proposed Solution
enable_pure_python_scheduler
inPyTorchConfig
to enable pure Python based schedulerConsidering it takes some time to migrate all the components and do performance tuning, pure Python based scheduler will be hidden from users at the begining.
LlmRequest
to support maintaining all state tensors in Python sideAll state tensors will be "duplicated" in Python side first. The member functions of
LlmRequest
will be dispatched to different paths depending onenable_pure_python_scheduler
flag.Implement pure Python based
Scheduler
Decouple
LlmRequest
fromKVCacheManager
interfaceFuture Works
We need some time to do performance tuning. After that, let's evaluate the possibility to enable pure Python based
Scheduler
by default.The text was updated successfully, but these errors were encountered: