Skip to content

Conversation

@manickavela29
Copy link

@manickavela29 manickavela29 commented Nov 2, 2025

Description

PR introduces multiplexed model serving with batching support to Ray Serve, enabling efficient hosting of multiple models on a single replica with automatic batching capabilities. This feature significantly improves resource utilization and reduces deployment costs for multi-model scenarios.

Performance Improvements
Asynchronous cleanup: Non-blocking cleanup of completed requests
Batching-aware routing: Prioritizes replicas with pending requests for the same model
Efficient request tracking: Optimized data structures for tracking pending requests by model

Related issues

  1. [Serve] model multiplexing and batching does not work together #56633
  2. [Serve] group requests by model_id in Model Multiplexing #50695

Additional information

Sample Code

@serve.deployment
class MultiModelService:
    def __init__(self):
        # Initialize multiplexer with batching
        self.multiplexer = ModelMultiplexer(
            model_load_func=self.load_model,
            max_num_models_per_replica=3,
            enable_batching=True,
            max_batch_size=8,
            batch_wait_timeout_s=0.01)
    
    async def load_model(self, model_id: str):
        # Your model loading logic
        return load_your_model(model_id)
    
    async def __call__(self, request):
        model_id = request.get("model_id")
        input_data = request.get("input")
        return await self.multiplexer.predict(input_data, model_id)

Test files
test_multiplex_batching.py - Core functionality tests
test_multiplex_batching_router.py - Request routing tests
test_multiplex_batching_utils.py - Utilities and fixtures

Signed-off-by: manickavela29 <[email protected]>
@manickavela29 manickavela29 requested a review from a team as a code owner November 2, 2025 17:30
cursor[bot]

This comment was marked as outdated.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces batching capabilities to the multiplexing feature in Ray Serve, which is a valuable enhancement for performance. The changes include updates to the multiplexed decorator, the _ModelMultiplexWrapper, and the request router to support batching-aware routing. The implementation is mostly solid, but there are a few key issues to address. The batching-aware routing logic in the request router appears incomplete, as it identifies opportunities for batching but doesn't act on them. Additionally, there's a potential performance issue with how completed requests are cleaned up. The new tests are comprehensive, but one of them has flawed logic that doesn't correctly test the batching mechanism, and there are some leftover debugging statements that should be removed. Overall, this is a great feature addition, and with these fixes, it will be a strong contribution.

@ray-gardener ray-gardener bot added serve Ray Serve Related Issue core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Nov 2, 2025
@manickavela29 manickavela29 marked this pull request as draft November 3, 2025 01:54
@manickavela29 manickavela29 changed the title multiplexing with batching [WIP] multiplexing with batching Nov 3, 2025
@manickavela29 manickavela29 force-pushed the dev/mulitplex_batching branch from 5ea4fdc to 550ee37 Compare November 3, 2025 04:02
Signed-off-by: manickavela29 <[email protected]>
@manickavela29 manickavela29 force-pushed the dev/mulitplex_batching branch 5 times, most recently from 0f61147 to f014923 Compare November 3, 2025 11:48
@manickavela29 manickavela29 force-pushed the dev/mulitplex_batching branch from f014923 to 8cdf240 Compare November 3, 2025 12:17
@manickavela29 manickavela29 marked this pull request as ready for review November 3, 2025 12:38
@manickavela29 manickavela29 changed the title [WIP] multiplexing with batching Ray Serve with multiplexing with batching Nov 3, 2025
@edoakes edoakes removed the core Issues that should be addressed in Ray Core label Nov 3, 2025
@manickavela29 manickavela29 changed the title Ray Serve with multiplexing with batching [Serve] Ray Serve with multiplexing with batching Nov 4, 2025
)
except Exception:
# Future might not have replica result, skip
pass
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Batching Optimization Fails due to Incorrect Future Check

The batching-aware routing logic is ineffective. The _get_pending_requests_for_model method filters out completed requests, but the subsequent batching-friendly replica selection logic incorrectly checks for pending_req.future.done(). This condition is always false, preventing the batching optimization from executing.

Fix in Cursor Fix in Web

request_context=request_context,
)

batch_queue.queue.put(single_request)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Misuse Breaks Lazy Batch Queue Wrapper API

The _LazyBatchQueueWrapper is used incorrectly by directly accessing its internal _queue attribute and its queue attribute. This bypasses the wrapper's lazy initialization and violates its intended API, which can lead to AttributeError or other runtime errors when handling batched requests or during shutdown.

Additional Locations (2)

Fix in Cursor Fix in Web

@manickavela29 manickavela29 marked this pull request as draft November 5, 2025 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants