Skip to content

[Feature] Add SiMM as sglang HiCache Storage backend#18016

Open
hhu-scitix wants to merge 14 commits intosgl-project:mainfrom
scitix:feature/simm
Open

[Feature] Add SiMM as sglang HiCache Storage backend#18016
hhu-scitix wants to merge 14 commits intosgl-project:mainfrom
scitix:feature/simm

Conversation

@hhu-scitix
Copy link

@hhu-scitix hhu-scitix commented Jan 31, 2026

Description

SiMM(Scitix In-Memory Middleware) is a distributed, high-performance, elastic cache acceleration layer for all AI workloads.

Feature

  1. support 'page_first' and 'page_first_direct' for zero copy cache layout.
  2. support numa-aware RDMA nic select.

Benchmarking and Profiling

Test DeepSeek R1 on 8* H200 141G with 8*400Gbps RoCE network (MT4129).

GPU Driver: 570.86.15, CUDA version: 12.9

Use benchmark/hicache/bench_multiture.py.

client:

python3 bench_multiturn.py \
  --num-clients 128 --max-parallel * \
  --request-length 8000 --output-length 200 \
  --host 127.0.0.1 --port 8080 --disable-auto-run \
  --disable-random-sample \
  --model-path /models/preset/deepseek-ai/DeepSeek-R1/v1.0 \
  --num-rounds=* --seed=* \
  --ready-queue-policy=fifo \
  --sub-question-input-length=128

server:

python3 -m sglang.launch_server \
  --model /models/preset/deepseek-ai/DeepSeek-R1/v1.0 \
  --trust-remote-code \
  --tp 8 --mem-fraction-static 0.75 \
  --page-size 64 \
  --enable-hierarchical-cache \
  --hicache-ratio 1.1 \
  --hicache-mem-layout page_first_direct \
  --hicache-io-backend direct \
  --hicache-storage-backend simm \
  --hicache-write-policy write_through \
  --hicache-storage-prefetch-policy timeout \
  --hicache-storage-backend-extra-config '{"manager_address":"0.0.0.0:30001"}' \
  --port 8080
rounds parallel Req throughput (req/s) Input throughput (token/s) Output throughput (token/s) SLO (ms)
SiMM GPU SiMM GPU SiMM GPU SiMM GPU
TTFT E2E Latency TTFT E2E Latency
3 4 0.81 0.80 6856.36 6810.02 161.14 159.83 0.43 3.20 0.50 3.45
8 0.97 0.90 8250.51 7670.94 193.62 180.09 0.47 3.84 0.56 4.05
16 0.97 1.02 8236.79 8653.88 193.49 203.30 0.49 3.73 0.63 4.92
10 4 0.88 0.80 8153.91 7767.07 168.62 160.61 0.41 3.11 0.56 3.57
8 0.97 0.91 9353.14 8819.79 193.55 182.40 0.43 3.65 0.66 4.51
16 1.01 0.98 9733.29 9459.86 201.32 195.47 0.50 4.08 0.72 5.23

'GPU' means only use GPU cache (no use hicache).

Checklist

hhu-scitix and others added 9 commits January 30, 2026 15:24
Signed-off-by: lkm2835 <lkm2835@gmail.com>
Co-authored-by: lgai-exaone <exaonemodels@lgresearch.ai>
Co-authored-by: lkm2835 <lkm2835@gmail.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: yizhang2077 <1109276519@qq.com>
…#17492)

Co-authored-by: McZyWu <zhuoyun.wu.23@ucl.ac.uk>
Co-authored-by: chenyang08056032 <chenyang08056032@163.com>
Co-authored-by: Hexq0210 <893781835@qq.com>
@github-actions github-actions bot added Multi-modal multi-modal language model hicache Hierarchical Caching for SGLang npu diffusion SGLang Diffusion labels Jan 31, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @hhu-scitix, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands SGLang's capabilities by integrating SiMM as a distributed, high-performance KV cache backend, which is crucial for scaling AI workloads. It also introduces support for the ExaoneMoE model, enhances the NPU attention backend for better cross-attention handling, and refines multimodal processing and speculative decoding with new safety checks and grammar integration. These changes collectively aim to improve performance, scalability, and model compatibility within the SGLang framework.

Highlights

  • SiMM HiCache Storage Backend Integration: Introduced SiMM (Scitix In-Memory Middleware) as a new, high-performance, distributed L3 KV cache storage backend for SGLang's HiCache. This includes support for 'page_first' and 'page_first_direct' memory layouts for zero-copy cache, and NUMA-aware RDMA NIC selection for optimized data transfer. Configuration can be done via extra arguments, JSON files, or environment variables.
  • NPU/Ascend Attention Backend Enhancements: Improved the NPU (Ascend) attention backend by introducing AscendTorchNativeAttnBackend to handle native SDPA operations. This update specifically addresses accuracy issues in cross-attention scenarios and ensures proper KV cache handling for both extend and decode phases, especially when encoder_lens are present.
  • LoRA Scale Parameter Addition: Added a lora_scale parameter to ServerArgs and its corresponding CLI argument. This allows users to specify a scaling factor for merging LoRA adapters, similar to the lora_scale in Diffusers, enhancing flexibility for LoRA usage.
  • ExaoneMoE Model Support: Added comprehensive support for the ExaoneMoEForCausalLM model, including its MLP, Attention, and DecoderLayer components. This also includes a specialized ExaoneMoEForCausalLMMTP for speculative decoding, and updates to model configuration and server arguments to properly handle this architecture.
  • Multimodal Padding Safety and Distributed Attention Refinements: Implemented a sanity check for multimodal pad values to prevent overlap with valid token IDs, ensuring robustness in multimodal models. Additionally, distributed attention mechanisms in Qwen3-VL and general multimodal utilities were refined to consistently use get_attention_tp_size/rank and get_attention_tp_group().all_gather for improved data parallelism.
  • Grammar-Aware Speculative Decoding: Enhanced speculative decoding by integrating grammar objects into the Ngram verification process. This allows for the generation of token bitmasks based on grammar rules, enabling more accurate and constrained output generation during speculative decoding.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Co-authored-by: wangtiance <tiancew@qq.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces SiMM as a new storage backend for HiCache, a significant feature that enhances distributed caching capabilities. It also adds support for the Exaone-MoE model family, including its multi-token prediction variant. Additionally, the changes include several valuable improvements and bug fixes, such as correcting padding logic for mrope_positions and safely initializing bias tensors with zeros. The overall code quality is high, and the new features are well-integrated.

Comment on lines 862 to 871
if self.mrope_positions is not None:
self.mrope_positions = self._pad_tensor_to_size(self.mrope_positions, bs)
self.mrope_positions = torch.cat(
[
self.mrope_positions,
self.mrope_positions.new_zeros(
3, num_tokens - self.mrope_positions.shape[1]
),
],
dim=1,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The previous padding logic for mrope_positions appeared to be incorrect as it was padding along the wrong dimension with the batch size. This change correctly pads the sequence length dimension to num_tokens, which fixes the issue. Good catch!

Comment on lines 356 to 358
self.bias = Parameter(
torch.empty(self.output_size_per_partition, dtype=params_dtype)
torch.zeros(self.output_size_per_partition, dtype=params_dtype)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Initializing the bias tensor with torch.zeros instead of torch.empty is a great improvement for robustness. This prevents potential non-deterministic behavior from uninitialized memory if the bias is not loaded from a checkpoint.

@xiezhq-hermann xiezhq-hermann self-assigned this Jan 31, 2026
MOONCAKE_CHECK_SERVER = EnvBool(False)
MOONCAKE_STANDALONE_STORAGE = EnvBool(False)

# SiMM
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move all the environment variables into --hicache-storage-backend-extra-config as what have done for other backends as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have support --hicache-storage-backend-extra-config already, and removed all the environment variables

@xiezhq-hermann
Copy link
Collaborator

Clean PR thanks : )

except ImportError as e:
raise ImportError(
"Please install simm by following the instructions at "
"to run vLLM with SimmConnector."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Maybe you need to change vLLM into SGLang in the loginfo
  2. I don't see following the instructions in the loginfo, could you add it please

@staticmethod
def from_file() -> "SiMMConfig":
"""Load the config from a JSON file."""
if os.environ.get(SGLANG_HICACHE_SIMM_CONFIG_PATH_ENV_VAR) is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please move all the envs to here

from collections import defaultdict
from dataclasses import dataclass
from datetime import datetime
from typing import Any, Dict, List, Optional
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use TYPE_CHECKING to wrap it and try to reduce time spent importing sglang at runtime

Comment on lines 16 to 19
keys = []
for _ in range(kv_num):
key = "test_" + str(uuid.uuid4())
keys.append(key)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use list-comprehension instead of for-loop

try:
self.mr_ext = register_mr(buffer)
if self.mr_ext is None:
logger.error(f"Failed to register buffer")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please add more info to tell developers how to debug or solve this error

for key_ in keys:
key_list.append(f"{key_}_{self.mha_suffix}_k")
key_list.append(f"{key_}_{self.mha_suffix}_v")
assert len(key_list) == len(ptr_list)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hope you can add more docstring or loginfo in these assert-statements.


Before launch `SGLang server` with SiMM, you should launch SiMM `cluster manager service` and `data server service`.

You can visit [SiMM official deploy guide]() and deploy SiMM on your K8S cluster with RDMA network.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lack link for SiMM official deploy guide.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will open-source SiMM by the end of February and update the documentation links in this PR accordingly.

"aibrix",
"dynamic",
"eic",
"simm",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to install SiMM for testing?

Copy link
Author

@hhu-scitix hhu-scitix Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use docker/simm.Dockerfile to build SGLang image with SiMM package, and SiMM backend bin can get from https://oss-ap-southeast.scitix.ai/hisys-simm-depository/simm/v0.2.0/ .
This is built based ubuntu24.04, and you can build from source after we open-source SiMM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion documentation Improvements or additions to documentation hicache Hierarchical Caching for SGLang lora Multi-modal multi-modal language model npu run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.