Skip to content

Conversation

@khuonglmhw
Copy link
Collaborator

@khuonglmhw khuonglmhw commented Dec 11, 2025

Purpose

Implement Mooncake storage EC connector

Test Plan

Test Result

The result of script disagg_1e1pd_example.sh on A100-40GB

rdma:

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  18.40     
Total input tokens:                      15000     
Total generated tokens:                  10000     
Request throughput (req/s):              5.43      
Output token throughput (tok/s):         543.41    
Peak output token throughput (tok/s):    681.00    
Peak concurrent requests:                100.00    
Total Token throughput (tok/s):          1358.53   
---------------Time to First Token----------------
Mean TTFT (ms):                          8403.98   
Median TTFT (ms):                        8555.08   
P99 TTFT (ms):                           16054.71  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          25.17     
Median TPOT (ms):                        25.08     
P99 TPOT (ms):                           29.13     
---------------Inter-token Latency----------------
Mean ITL (ms):                           26.79     
Median ITL (ms):                         22.37     
P99 ITL (ms):                            115.44    
==================================================

tcp:

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  19.22     
Total input tokens:                      15000     
Total generated tokens:                  10000     
Request throughput (req/s):              5.20      
Output token throughput (tok/s):         520.41    
Peak output token throughput (tok/s):    722.00    
Peak concurrent requests:                100.00    
Total Token throughput (tok/s):          1301.03   
---------------Time to First Token----------------
Mean TTFT (ms):                          9057.54   
Median TTFT (ms):                        8348.85   
P99 TTFT (ms):                           15949.27  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          24.83     
Median TPOT (ms):                        24.95     
P99 TPOT (ms):                           26.65     
---------------Inter-token Latency----------------
Mean ITL (ms):                           26.29     
Median ITL (ms):                         22.53     
P99 ITL (ms):                            92.85     
==================================================

example connector (disk):

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  18.57     
Total input tokens:                      15000     
Total generated tokens:                  10000     
Request throughput (req/s):              5.38      
Output token throughput (tok/s):         538.45    
Peak output token throughput (tok/s):    736.00    
Peak concurrent requests:                100.00    
Total Token throughput (tok/s):          1346.12   
---------------Time to First Token----------------
Mean TTFT (ms):                          8258.42   
Median TTFT (ms):                        8068.73   
P99 TTFT (ms):                           15228.11  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          23.78     
Median TPOT (ms):                        23.86     
P99 TPOT (ms):                           25.54     
---------------Inter-token Latency----------------
Mean ITL (ms):                           24.93     
Median ITL (ms):                         21.56     
P99 ITL (ms):                            92.76     
==================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@khuonglmhw khuonglmhw changed the title initial [Feat] EPD Mooncake Store Connector Dec 11, 2025
Copy link
Collaborator

@knlnguyen1802 knlnguyen1802 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now.

@knlnguyen1802
Copy link
Collaborator

cc @fake0fan PLTA again

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a Mooncake storage connector for the Encoder Cache (EC) disaggregated architecture, enabling distributed encoder cache transfer across vLLM instances. The connector supports both regular and zero-copy transfer modes for efficient multimodal data sharing.

Key Changes:

  • Introduces ECMooncakeStorageConnector with support for async batch operations and zero-copy transfers using pinned memory
  • Adds TensorMemoryPool with buddy allocation for efficient pinned memory management with FIFO eviction
  • Integrates synchronization mechanism (wait_for_save()) to ensure encoder cache persistence before request completion

Reviewed changes

Copilot reviewed 10 out of 12 changed files in this pull request and generated 20 comments.

Show a summary per file
File Description
vllm/distributed/ec_transfer/ec_lookup_buffer/mooncake_store.py Core Mooncake store implementation with batch operations and zero-copy support
vllm/distributed/ec_transfer/ec_connector/mooncake_storage_connector.py Connector interface implementation for EC transfer operations
vllm/distributed/ec_transfer/utils/tensor_memory_pool.py Buddy allocator-based memory pool for pinned host memory management
vllm/distributed/ec_transfer/ec_connector/base.py Added wait_for_save() interface method to base connector
vllm/v1/worker/ec_connector_model_runner_mixin.py Added mixin method to wait for async save operations
vllm/v1/worker/gpu_model_runner.py Integrated wait for save after multimodal encoding
vllm/distributed/ec_transfer/ec_connector/factory.py Registered new Mooncake connector in factory
tests/v1/ec_connector/unit/test_mooncake_store.py Comprehensive unit tests with fake Mooncake store implementation
examples/.../mooncake_connector/disagg_1e1pd_example.sh Example script for 1 encoder + 1 prefill/decode setup
examples/.../mooncake_connector/disagg_1e1p1d_example.sh Example script for 1 encoder + 1 prefill + 1 decode setup

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Khuong Le <[email protected]>
Raises:
ValueError: If tensor is not on CUDA or allocation fails
"""
if not tensor.is_cuda:
Copy link

@Shirley125 Shirley125 Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using is_cuda is incompatible with NPU. And this file is similar to distributed/kv_transfer/kv_connector/v1/p2p/tensor_memory_pool.py. Is it possible to avoid adding a separate file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants