Epd mooncake engine #19

khuonglmhw · 2025-12-22T11:33:44Z

Purpose

Workflow

Test Plan

Potential issue: currently the transfer pool automatically evicts tensor when it is full, so the tensor might not get transferred. Im not sure if it could happen in production. The transfer pool size is 1GB and it works fine so far.

Test Result

100 requests / 400 text tokens / 3 images (560,560,1)

Mooncake

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  18.50     
Total input tokens:                      40000     
Total generated tokens:                  10000     
Request throughput (req/s):              5.41      
Output token throughput (tok/s):         540.57    
Peak output token throughput (tok/s):    1755.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          2702.84   
---------------Time to First Token----------------
Mean TTFT (ms):                          11120.87  
Median TTFT (ms):                        12778.88  
P99 TTFT (ms):                           16057.00  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.15     
Median TPOT (ms):                        34.17     
P99 TPOT (ms):                           46.32     
---------------Inter-token Latency----------------
Mean ITL (ms):                           63.72     
Median ITL (ms):                         22.41     
P99 ITL (ms):                            1071.29   
==================================================

Disk

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  21.53     
Total input tokens:                      40000     
Total generated tokens:                  10000     
Request throughput (req/s):              4.65      
Output token throughput (tok/s):         464.56    
Peak output token throughput (tok/s):    1645.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          2322.80   
---------------Time to First Token----------------
Mean TTFT (ms):                          13298.42  
Median TTFT (ms):                        14679.84  
P99 TTFT (ms):                           19026.35  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          31.04     
Median TPOT (ms):                        31.07     
P99 TPOT (ms):                           43.69     
---------------Inter-token Latency----------------
Mean ITL (ms):                           62.96     
Median ITL (ms):                         23.50     
P99 ITL (ms):                            1121.29   
==================================================

Correctness:

================================================================================
Metrics:
{
    "explicit_prompt_relaxed_correctness": 0.8124,
    "anywhere_in_answer_relaxed_correctness": 0.8152
}
================================================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copilot

Pull request overview

This PR adds Mooncake engine support for disaggregated encoder cache (EC) transfer in encoder-prefill-decode (EPD) architectures. The implementation enables remote encoder cache transfer between encoder workers and prefill/decode workers using the Mooncake transfer engine with RDMA support, improving performance in disaggregated serving scenarios.

Key Changes:

Implemented MooncakeECConnector with scheduler and worker components for EC transfer over RDMA
Extended TensorMemoryPool to support both CPU (pinned) and CUDA device memory with auto-eviction capability
Added ec_transfer_params propagation throughout the request/response pipeline (similar to kv_transfer_params)

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
vllm/distributed/ec_transfer/ec_connector/mooncake_connector.py	New Mooncake connector implementation for EC transfer with scheduler/worker split and ZMQ-based coordination
vllm/distributed/kv_transfer/kv_connector/v1/p2p/tensor_memory_pool.py	Enhanced memory pool to support CUDA device memory and auto-eviction for flexible buffer management
vllm/v1/worker/gpu_model_runner.py	Integrated transfer pool initialization and wait_for_ec_load call before MM embedding gathering
vllm/v1/worker/ec_connector_model_runner_mixin.py	Added wait_for_load method to synchronize EC transfers before use
vllm/v1/request.py	Added ec_transfer_params field to Request for propagating EC transfer metadata
vllm/v1/engine/init.py	Added ec_transfer_params to EngineCoreOutput for returning EC metadata
vllm/v1/engine/output_processor.py	Plumbed ec_transfer_params through output processing pipeline
vllm/v1/core/sched/scheduler.py	Added _ec_connector_finished method and ec_transfer_params handling in request lifecycle
vllm/outputs.py	Added ec_transfer_params field to RequestOutput
vllm/envs.py	Added VLLM_EC_MOONCAKE_BOOTSTRAP_PORT configuration for Mooncake handshake
vllm/entrypoints/openai/serving_completion.py	Propagated ec_transfer_params in completion response generation
vllm/entrypoints/openai/serving_chat.py	Propagated ec_transfer_params in chat completion responses
vllm/entrypoints/openai/protocol.py	Added ec_transfer_params fields to request/response protocol models
vllm/distributed/ec_transfer/ec_connector/factory.py	Registered MooncakeECConnector in connector factory
vllm/distributed/ec_transfer/ec_connector/example_connector.py	Added wait_for_load stub to example connector for interface consistency
vllm/distributed/ec_transfer/ec_connector/base.py	Renamed register_caches to register_encoder_cache and added wait_for_load abstract method
examples/online_serving/disaggregated_encoder/mooncake_connector/disagg_1e1pd_example.sh	Shell script for 1 encoder + 1 prefill-decode worker setup with Mooncake
examples/online_serving/disaggregated_encoder/mooncake_connector/disagg_1e1p1d_example.sh	Shell script for 1 encoder + 1 prefill + 1 decode worker setup with Mooncake
examples/online_serving/disaggregated_encoder/disagg_epd_proxy.py	Enhanced proxy to aggregate ec_transfer_params from encoder responses and forward to prefill

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-02T03:46:24Z

vllm/outputs.py

                                  None if decoder-only.
        num_cached_tokens: The number of tokens with prefix cache hit.
        kv_transfer_params: The params for remote K/V transfer.
+        ec_tranfer_params: The params for remote EC transfer.


Typo in docstring: "ec_tranfer_params" should be "ec_transfer_params" (missing 's').

Suggested change

ec_tranfer_params: The params for remote EC transfer.

ec_transfer_params: The params for remote EC transfer.

Copilot · 2026-01-02T03:46:24Z

vllm/distributed/ec_transfer/ec_connector/mooncake_connector.py

+        if not self.is_producer:
+            self._sender_executor.shutdown(wait=False)
+            if self._mooncake_sender_t:
+                self._mooncake_sender_t.join()
+        elif self.receiver_loop.is_running():
+            self.receiver_loop.call_soon_threadsafe(self.receiver_loop.stop)
+            self._mooncake_receiver_t.join()


The shutdown logic appears to have inverted conditions. When is_producer is True, a producer should shutdown _sender_executor (which is created for producers). When is_producer is False (consumer), it should shutdown the receiver loop. Currently, the conditions are reversed.

Copilot · 2026-01-02T03:46:25Z

vllm/distributed/kv_transfer/kv_connector/v1/p2p/tensor_memory_pool.py

        Raises:
            ValueError: If address is invalid or not allocated
        """
+        if not addr:


The condition if not addr: will be True when addr is 0, which is a valid memory address. This should be if addr is None: to properly check for None.

Suggested change

if not addr:

if addr is None:

vllm/distributed/kv_transfer/kv_connector/v1/p2p/tensor_memory_pool.py

vllm/entrypoints/openai/protocol.py

vllm/v1/core/sched/scheduler.py

vllm/distributed/ec_transfer/ec_connector/mooncake_connector.py

Copilot · 2026-01-02T03:46:26Z

vllm/distributed/kv_transfer/kv_connector/v1/p2p/tensor_memory_pool.py

+        while True:
+            try:
+                return self._allocate(required_size)
+            except ValueError:
+                if self.auto_evict:
+                    self.free()
+                else:
+                    raise



The auto_evict loop could run indefinitely if the pool cannot satisfy the allocation even after freeing the oldest block. There should be a check to prevent infinite looping, such as tracking whether any block was freed or limiting the number of eviction attempts.

Suggested change

while True:

try:

return self._allocate(required_size)

except ValueError:

if self.auto_evict:

self.free()

else:

raise

# Bound the number of eviction attempts to avoid potential infinite loops.

# We cannot free more blocks than currently allocated.

max_evictions = len(self.allocated_blocks)

evictions = 0

while True:

try:

return self._allocate(required_size)

except ValueError as e:

if not self.auto_evict:

# Auto-eviction disabled: propagate the allocation failure.

raise

if evictions >= max_evictions:

# All currently allocated blocks have already been considered

# for eviction; further attempts are unlikely to succeed.

raise e

prev_len = len(self.allocated_blocks)

try:

# Free the oldest allocated block.

self.free()

except ValueError:

# No block could be freed; propagate the original allocation error.

raise e

# Ensure that eviction made progress; if not, avoid looping endlessly.

if len(self.allocated_blocks) >= prev_len:

raise e

evictions += 1

examples/online_serving/disaggregated_encoder/disagg_epd_proxy.py

Co-authored-by: Copilot <[email protected]>

…_pool.py Co-authored-by: Copilot <[email protected]>

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

khuonglm added 4 commits December 18, 2025 11:37

implementing

a180de0

implementing

5a75c2e

mooncake transfer engine

0a82ef1

fix precommit

5a6a177

knlnguyen1802 requested a review from Copilot January 2, 2026 03:40

Copilot started reviewing on behalf of knlnguyen1802 January 2, 2026 03:41 View session

Copilot AI reviewed Jan 2, 2026

View reviewed changes

knlnguyen1802 and others added 5 commits January 2, 2026 11:56

Update examples/online_serving/disaggregated_encoder/disagg_epd_proxy.py

9c7c72f

Co-authored-by: Copilot <[email protected]>

Update vllm/distributed/ec_transfer/ec_connector/mooncake_connector.py

158f70d

Co-authored-by: Copilot <[email protected]>

Update vllm/v1/core/sched/scheduler.py

05298e1

Co-authored-by: Copilot <[email protected]>

Update vllm/entrypoints/openai/protocol.py

577d71e

Co-authored-by: Copilot <[email protected]>

Update vllm/distributed/kv_transfer/kv_connector/v1/p2p/tensor_memory…

829b575

…_pool.py Co-authored-by: Copilot <[email protected]>

Copilot AI reviewed Jan 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Epd mooncake engine #19

Epd mooncake engine #19

Uh oh!

khuonglmhw commented Dec 22, 2025 •

edited by github-actions bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 2, 2026

Uh oh!

Copilot AI Jan 2, 2026

Uh oh!

Copilot AI Jan 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 2, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	ec_tranfer_params: The params for remote EC transfer.
	ec_transfer_params: The params for remote EC transfer.

-        while True:
-            try:
-                return self._allocate(required_size)
-            except ValueError:
-                if self.auto_evict:
-                    self.free()
-                else:
-                    raise
+        # Bound the number of eviction attempts to avoid potential infinite loops.
+        # We cannot free more blocks than currently allocated.
+        max_evictions = len(self.allocated_blocks)
+        evictions = 0
+        while True:
+            try:
+                return self._allocate(required_size)
+            except ValueError as e:
+                if not self.auto_evict:
+                    # Auto-eviction disabled: propagate the allocation failure.
+                    raise
+                if evictions >= max_evictions:
+                    # All currently allocated blocks have already been considered
+                    # for eviction; further attempts are unlikely to succeed.
+                    raise e
+                prev_len = len(self.allocated_blocks)
+                try:
+                    # Free the oldest allocated block.
+                    self.free()
+                except ValueError:
+                    # No block could be freed; propagate the original allocation error.
+                    raise e
+                # Ensure that eviction made progress; if not, avoid looping endlessly.
+                if len(self.allocated_blocks) >= prev_len:
+                    raise e
+                evictions += 1

Epd mooncake engine #19

Are you sure you want to change the base?

Epd mooncake engine #19

Uh oh!

Conversation

khuonglmhw commented Dec 22, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Workflow

Test Plan

Test Result

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

khuonglmhw commented Dec 22, 2025 •

edited by github-actions bot

Loading