-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[Serve] Ray Serve with multiplexing with batching #58358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[Serve] Ray Serve with multiplexing with batching #58358
Conversation
Signed-off-by: manickavela29 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces batching capabilities to the multiplexing feature in Ray Serve, which is a valuable enhancement for performance. The changes include updates to the multiplexed decorator, the _ModelMultiplexWrapper, and the request router to support batching-aware routing. The implementation is mostly solid, but there are a few key issues to address. The batching-aware routing logic in the request router appears incomplete, as it identifies opportunities for batching but doesn't act on them. Additionally, there's a potential performance issue with how completed requests are cleaned up. The new tests are comprehensive, but one of them has flawed logic that doesn't correctly test the batching mechanism, and there are some leftover debugging statements that should be removed. Overall, this is a great feature addition, and with these fixes, it will be a strong contribution.
Signed-off-by: manickavela29 <[email protected]>
5ea4fdc to
550ee37
Compare
Signed-off-by: manickavela29 <[email protected]>
0f61147 to
f014923
Compare
Signed-off-by: manickavela29 <[email protected]>
f014923 to
8cdf240
Compare
| ) | ||
| except Exception: | ||
| # Future might not have replica result, skip | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Batching Optimization Fails due to Incorrect Future Check
The batching-aware routing logic is ineffective. The _get_pending_requests_for_model method filters out completed requests, but the subsequent batching-friendly replica selection logic incorrectly checks for pending_req.future.done(). This condition is always false, preventing the batching optimization from executing.
| request_context=request_context, | ||
| ) | ||
|
|
||
| batch_queue.queue.put(single_request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Misuse Breaks Lazy Batch Queue Wrapper API
The _LazyBatchQueueWrapper is used incorrectly by directly accessing its internal _queue attribute and its queue attribute. This bypasses the wrapper's lazy initialization and violates its intended API, which can lead to AttributeError or other runtime errors when handling batched requests or during shutdown.
Description
PR introduces multiplexed model serving with batching support to Ray Serve, enabling efficient hosting of multiple models on a single replica with automatic batching capabilities. This feature significantly improves resource utilization and reduces deployment costs for multi-model scenarios.
Performance Improvements
Asynchronous cleanup: Non-blocking cleanup of completed requests
Batching-aware routing: Prioritizes replicas with pending requests for the same model
Efficient request tracking: Optimized data structures for tracking pending requests by model
Related issues
model_idin Model Multiplexing #50695Additional information
Sample Code
Test files
test_multiplex_batching.py - Core functionality tests
test_multiplex_batching_router.py - Request routing tests
test_multiplex_batching_utils.py - Utilities and fixtures