[Feature] Add SiMM as sglang HiCache Storage backend#18016
[Feature] Add SiMM as sglang HiCache Storage backend#18016hhu-scitix wants to merge 14 commits intosgl-project:mainfrom
Conversation
Signed-off-by: lkm2835 <lkm2835@gmail.com> Co-authored-by: lgai-exaone <exaonemodels@lgresearch.ai> Co-authored-by: lkm2835 <lkm2835@gmail.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: yizhang2077 <1109276519@qq.com>
…#17492) Co-authored-by: McZyWu <zhuoyun.wu.23@ucl.ac.uk> Co-authored-by: chenyang08056032 <chenyang08056032@163.com> Co-authored-by: Hexq0210 <893781835@qq.com>
Summary of ChangesHello @hhu-scitix, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands SGLang's capabilities by integrating SiMM as a distributed, high-performance KV cache backend, which is crucial for scaling AI workloads. It also introduces support for the ExaoneMoE model, enhances the NPU attention backend for better cross-attention handling, and refines multimodal processing and speculative decoding with new safety checks and grammar integration. These changes collectively aim to improve performance, scalability, and model compatibility within the SGLang framework. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Co-authored-by: wangtiance <tiancew@qq.com>
There was a problem hiding this comment.
Code Review
This pull request introduces SiMM as a new storage backend for HiCache, a significant feature that enhances distributed caching capabilities. It also adds support for the Exaone-MoE model family, including its multi-token prediction variant. Additionally, the changes include several valuable improvements and bug fixes, such as correcting padding logic for mrope_positions and safely initializing bias tensors with zeros. The overall code quality is high, and the new features are well-integrated.
| if self.mrope_positions is not None: | ||
| self.mrope_positions = self._pad_tensor_to_size(self.mrope_positions, bs) | ||
| self.mrope_positions = torch.cat( | ||
| [ | ||
| self.mrope_positions, | ||
| self.mrope_positions.new_zeros( | ||
| 3, num_tokens - self.mrope_positions.shape[1] | ||
| ), | ||
| ], | ||
| dim=1, | ||
| ) |
| self.bias = Parameter( | ||
| torch.empty(self.output_size_per_partition, dtype=params_dtype) | ||
| torch.zeros(self.output_size_per_partition, dtype=params_dtype) | ||
| ) |
| MOONCAKE_CHECK_SERVER = EnvBool(False) | ||
| MOONCAKE_STANDALONE_STORAGE = EnvBool(False) | ||
|
|
||
| # SiMM |
There was a problem hiding this comment.
can we move all the environment variables into --hicache-storage-backend-extra-config as what have done for other backends as well?
There was a problem hiding this comment.
we have support --hicache-storage-backend-extra-config already, and removed all the environment variables
|
Clean PR thanks : ) |
| except ImportError as e: | ||
| raise ImportError( | ||
| "Please install simm by following the instructions at " | ||
| "to run vLLM with SimmConnector." |
There was a problem hiding this comment.
- Maybe you need to change
vLLMintoSGLangin the loginfo - I don't see
following the instructionsin the loginfo, could you add it please
| @staticmethod | ||
| def from_file() -> "SiMMConfig": | ||
| """Load the config from a JSON file.""" | ||
| if os.environ.get(SGLANG_HICACHE_SIMM_CONFIG_PATH_ENV_VAR) is None: |
| from collections import defaultdict | ||
| from dataclasses import dataclass | ||
| from datetime import datetime | ||
| from typing import Any, Dict, List, Optional |
There was a problem hiding this comment.
please use TYPE_CHECKING to wrap it and try to reduce time spent importing sglang at runtime
| keys = [] | ||
| for _ in range(kv_num): | ||
| key = "test_" + str(uuid.uuid4()) | ||
| keys.append(key) |
There was a problem hiding this comment.
please use list-comprehension instead of for-loop
| try: | ||
| self.mr_ext = register_mr(buffer) | ||
| if self.mr_ext is None: | ||
| logger.error(f"Failed to register buffer") |
There was a problem hiding this comment.
could you please add more info to tell developers how to debug or solve this error
| for key_ in keys: | ||
| key_list.append(f"{key_}_{self.mha_suffix}_k") | ||
| key_list.append(f"{key_}_{self.mha_suffix}_v") | ||
| assert len(key_list) == len(ptr_list) |
There was a problem hiding this comment.
hope you can add more docstring or loginfo in these assert-statements.
|
|
||
| Before launch `SGLang server` with SiMM, you should launch SiMM `cluster manager service` and `data server service`. | ||
|
|
||
| You can visit [SiMM official deploy guide]() and deploy SiMM on your K8S cluster with RDMA network. |
There was a problem hiding this comment.
Lack link for SiMM official deploy guide.
There was a problem hiding this comment.
We will open-source SiMM by the end of February and update the documentation links in this PR accordingly.
| "aibrix", | ||
| "dynamic", | ||
| "eic", | ||
| "simm", |
There was a problem hiding this comment.
How to install SiMM for testing?
There was a problem hiding this comment.
You can use docker/simm.Dockerfile to build SGLang image with SiMM package, and SiMM backend bin can get from https://oss-ap-southeast.scitix.ai/hisys-simm-depository/simm/v0.2.0/ .
This is built based ubuntu24.04, and you can build from source after we open-source SiMM.
d4dc42e to
c40573d
Compare
Description
SiMM(Scitix In-Memory Middleware) is a distributed, high-performance, elastic cache acceleration layer for all AI workloads.
Feature
Benchmarking and Profiling
Test DeepSeek R1 on 8* H200 141G with 8*400Gbps RoCE network (MT4129).
GPU Driver: 570.86.15, CUDA version: 12.9
Use benchmark/hicache/bench_multiture.py.
client:
server:
'GPU' means only use GPU cache (no use hicache).
Checklist