[Feature] Add SiMM as sglang HiCache Storage backend by hhu-scitix · Pull Request #18016 · sgl-project/sglang

hhu-scitix · 2026-01-31T02:32:57Z

Description

SiMM(Scitix In-Memory Middleware) is a distributed, high-performance, elastic cache acceleration layer for all AI workloads.

Feature

support 'page_first' and 'page_first_direct' for zero copy cache layout.
support numa-aware RDMA nic select.

Benchmarking and Profiling

Test DeepSeek R1 on 8* H200 141G with 8*400Gbps RoCE network (MT4129).

GPU Driver: 570.86.15, CUDA version: 12.9

Use benchmark/hicache/bench_multiture.py.

client:

python3 bench_multiturn.py \
  --num-clients 128 --max-parallel * \
  --request-length 8000 --output-length 200 \
  --host 127.0.0.1 --port 8080 --disable-auto-run \
  --disable-random-sample \
  --model-path /models/preset/deepseek-ai/DeepSeek-R1/v1.0 \
  --num-rounds=* --seed=* \
  --ready-queue-policy=fifo \
  --sub-question-input-length=128

server:

python3 -m sglang.launch_server \
  --model /models/preset/deepseek-ai/DeepSeek-R1/v1.0 \
  --trust-remote-code \
  --tp 8 --mem-fraction-static 0.75 \
  --page-size 64 \
  --enable-hierarchical-cache \
  --hicache-ratio 1.1 \
  --hicache-mem-layout page_first_direct \
  --hicache-io-backend direct \
  --hicache-storage-backend simm \
  --hicache-write-policy write_through \
  --hicache-storage-prefetch-policy timeout \
  --hicache-storage-backend-extra-config '{"manager_address":"0.0.0.0:30001"}' \
  --port 8080

rounds	parallel	Req throughput (req/s)		Input throughput (token/s)		Output throughput (token/s)		SLO (ms)
		SiMM	GPU	SiMM	GPU	SiMM	GPU	SiMM		GPU
								TTFT	E2E Latency	TTFT	E2E Latency
3	4	0.81	0.80	6856.36	6810.02	161.14	159.83	0.43	3.20	0.50	3.45
	8	0.97	0.90	8250.51	7670.94	193.62	180.09	0.47	3.84	0.56	4.05
	16	0.97	1.02	8236.79	8653.88	193.49	203.30	0.49	3.73	0.63	4.92
10	4	0.88	0.80	8153.91	7767.07	168.62	160.61	0.41	3.11	0.56	3.57
	8	0.97	0.91	9353.14	8819.79	193.55	182.40	0.43	3.65	0.66	4.51
	16	1.01	0.98	9733.29	9459.86	201.32	195.47	0.50	4.08	0.72	5.23

'GPU' means only use GPU cache (no use hicache).

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Signed-off-by: lkm2835 <lkm2835@gmail.com> Co-authored-by: lgai-exaone <exaonemodels@lgresearch.ai> Co-authored-by: lkm2835 <lkm2835@gmail.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

Co-authored-by: yizhang2077 <1109276519@qq.com>

…ed (sgl-project#17585)

…#17492) Co-authored-by: McZyWu <zhuoyun.wu.23@ucl.ac.uk> Co-authored-by: chenyang08056032 <chenyang08056032@163.com> Co-authored-by: Hexq0210 <893781835@qq.com>

…oject#17781)

gemini-code-assist · 2026-01-31T02:33:34Z

Summary of Changes

Hello @hhu-scitix, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands SGLang's capabilities by integrating SiMM as a distributed, high-performance KV cache backend, which is crucial for scaling AI workloads. It also introduces support for the ExaoneMoE model, enhances the NPU attention backend for better cross-attention handling, and refines multimodal processing and speculative decoding with new safety checks and grammar integration. These changes collectively aim to improve performance, scalability, and model compatibility within the SGLang framework.

Highlights

SiMM HiCache Storage Backend Integration: Introduced SiMM (Scitix In-Memory Middleware) as a new, high-performance, distributed L3 KV cache storage backend for SGLang's HiCache. This includes support for 'page_first' and 'page_first_direct' memory layouts for zero-copy cache, and NUMA-aware RDMA NIC selection for optimized data transfer. Configuration can be done via extra arguments, JSON files, or environment variables.
NPU/Ascend Attention Backend Enhancements: Improved the NPU (Ascend) attention backend by introducing AscendTorchNativeAttnBackend to handle native SDPA operations. This update specifically addresses accuracy issues in cross-attention scenarios and ensures proper KV cache handling for both extend and decode phases, especially when encoder_lens are present.
LoRA Scale Parameter Addition: Added a lora_scale parameter to ServerArgs and its corresponding CLI argument. This allows users to specify a scaling factor for merging LoRA adapters, similar to the lora_scale in Diffusers, enhancing flexibility for LoRA usage.
ExaoneMoE Model Support: Added comprehensive support for the ExaoneMoEForCausalLM model, including its MLP, Attention, and DecoderLayer components. This also includes a specialized ExaoneMoEForCausalLMMTP for speculative decoding, and updates to model configuration and server arguments to properly handle this architecture.
Multimodal Padding Safety and Distributed Attention Refinements: Implemented a sanity check for multimodal pad values to prevent overlap with valid token IDs, ensuring robustness in multimodal models. Additionally, distributed attention mechanisms in Qwen3-VL and general multimodal utilities were refined to consistently use get_attention_tp_size/rank and get_attention_tp_group().all_gather for improved data parallelism.
Grammar-Aware Speculative Decoding: Enhanced speculative decoding by integrating grammar objects into the Ngram verification process. This allows for the generation of token bitmasks based on grammar rules, enabling more accurate and constrained output generation during speculative decoding.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Co-authored-by: wangtiance <tiancew@qq.com>

gemini-code-assist

Code Review

This pull request introduces SiMM as a new storage backend for HiCache, a significant feature that enhances distributed caching capabilities. It also adds support for the Exaone-MoE model family, including its multi-token prediction variant. Additionally, the changes include several valuable improvements and bug fixes, such as correcting padding logic for mrope_positions and safely initializing bias tensors with zeros. The overall code quality is high, and the new features are well-integrated.

gemini-code-assist · 2026-01-31T02:35:54Z

python/sglang/srt/model_executor/forward_batch_info.py

        if self.mrope_positions is not None:
-            self.mrope_positions = self._pad_tensor_to_size(self.mrope_positions, bs)
+            self.mrope_positions = torch.cat(
+                [
+                    self.mrope_positions,
+                    self.mrope_positions.new_zeros(
+                        3, num_tokens - self.mrope_positions.shape[1]
+                    ),
+                ],
+                dim=1,
+            )


The previous padding logic for mrope_positions appeared to be incorrect as it was padding along the wrong dimension with the batch size. This change correctly pads the sequence length dimension to num_tokens, which fixes the issue. Good catch!

gemini-code-assist · 2026-01-31T02:35:54Z

python/sglang/srt/layers/linear.py

            self.bias = Parameter(
-                torch.empty(self.output_size_per_partition, dtype=params_dtype)
+                torch.zeros(self.output_size_per_partition, dtype=params_dtype)
            )


Initializing the bias tensor with torch.zeros instead of torch.empty is a great improvement for robustness. This prevents potential non-deterministic behavior from uninitialized memory if the bias is not loaded from a checkpoint.

xiezhq-hermann · 2026-01-31T04:55:15Z

python/sglang/srt/environ.py

    MOONCAKE_CHECK_SERVER = EnvBool(False)
    MOONCAKE_STANDALONE_STORAGE = EnvBool(False)

+    # SiMM


can we move all the environment variables into --hicache-storage-backend-extra-config as what have done for other backends as well?

we have support --hicache-storage-backend-extra-config already, and removed all the environment variables

xiezhq-hermann · 2026-01-31T04:59:10Z

Clean PR thanks : )

ping1jing2 · 2026-01-31T08:11:57Z

python/sglang/srt/mem_cache/storage/simm/hicache_simm.py

+except ImportError as e:
+    raise ImportError(
+        "Please install simm by following the instructions at "
+        "to run vLLM with SimmConnector."


Maybe you need to change vLLM into SGLang in the loginfo

I don't see following the instructions in the loginfo, could you add it please

ping1jing2 · 2026-01-31T08:18:12Z

python/sglang/srt/mem_cache/storage/simm/hicache_simm.py

+    @staticmethod
+    def from_file() -> "SiMMConfig":
+        """Load the config from a JSON file."""
+        if os.environ.get(SGLANG_HICACHE_SIMM_CONFIG_PATH_ENV_VAR) is None:


could you please move all the envs to here

ping1jing2 · 2026-01-31T08:26:32Z

python/sglang/srt/mem_cache/storage/simm/hicache_simm.py

+from collections import defaultdict
+from dataclasses import dataclass
+from datetime import datetime
+from typing import Any, Dict, List, Optional


please use TYPE_CHECKING to wrap it and try to reduce time spent importing sglang at runtime

ping1jing2 · 2026-01-31T08:47:55Z

python/sglang/srt/mem_cache/storage/simm/test_simm.py

+    keys = []
+    for _ in range(kv_num):
+        key = "test_" + str(uuid.uuid4())
+        keys.append(key)


please use list-comprehension instead of for-loop

ping1jing2 · 2026-01-31T09:07:52Z

python/sglang/srt/mem_cache/storage/simm/hicache_simm.py

+        try:
+            self.mr_ext = register_mr(buffer)
+            if self.mr_ext is None:
+                logger.error(f"Failed to register buffer")


could you please add more info to tell developers how to debug or solve this error

ping1jing2 · 2026-01-31T09:10:15Z

python/sglang/srt/mem_cache/storage/simm/hicache_simm.py

+        for key_ in keys:
+            key_list.append(f"{key_}_{self.mha_suffix}_k")
+            key_list.append(f"{key_}_{self.mha_suffix}_v")
+        assert len(key_list) == len(ptr_list)


hope you can add more docstring or loginfo in these assert-statements.

stmatengss · 2026-02-01T08:42:55Z

python/sglang/srt/mem_cache/storage/simm/README.md

+
+Before launch `SGLang server` with SiMM, you should launch SiMM `cluster manager service` and `data server service`.
+
+You can visit [SiMM official deploy guide]() and deploy SiMM on your K8S cluster with RDMA network.


Lack link for SiMM official deploy guide.

We will open-source SiMM by the end of February and update the documentation links in this PR accordingly.

stmatengss · 2026-02-01T08:46:54Z

python/sglang/srt/server_args.py

+                "aibrix",
+                "dynamic",
+                "eic",
+                "simm",


How to install SiMM for testing?

You can use docker/simm.Dockerfile to build SGLang image with SiMM package, and SiMM backend bin can get from https://oss-ap-southeast.scitix.ai/hisys-simm-depository/simm/v0.2.0/ .
This is built based ubuntu24.04, and you can build from source after we open-source SiMM.

hhu-scitix and others added 9 commits January 30, 2026 15:24

feat: add simm as hicache backend

89f6ddb

[Model] Add K-EXAONE model support (sgl-project#16294)

0787c67

Signed-off-by: lkm2835 <lkm2835@gmail.com> Co-authored-by: lgai-exaone <exaonemodels@lgresearch.ai> Co-authored-by: lkm2835 <lkm2835@gmail.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

[BUGFIX] Fix dp size > 1 for qwen3 vl model (sgl-project#17624)

59ef673

Co-authored-by: yizhang2077 <1109276519@qq.com>

[Diffusion] Fix lora default lora_scale bug (sgl-project#17982)

cc30e3a

Optimize GDN decode for Qwen3 Next (sgl-project#17094)

eca36e6

[BugFix] Fix server crashes when req.grammar and ngram spec are enabl…

eb6a67b

…ed (sgl-project#17585)

[NPU] support llama-3.2-11B-vision-instruct mode for NPU (sgl-project…

1d723b0

…#17492) Co-authored-by: McZyWu <zhuoyun.wu.23@ucl.ac.uk> Co-authored-by: chenyang08056032 <chenyang08056032@163.com> Co-authored-by: Hexq0210 <893781835@qq.com>

[sglang] fix mm token padded value overlap with text token id (sgl-pr…

c61b169

…oject#17781)

feat: format code

a346e99

hhu-scitix requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, JustinTong0323, Qiaolin-Yu, Ying1123, ch-wan, hanming-lu, hebiao064, hnyls2002, iforgetmyname, ispobock, merrymercy, mickqian, ping1jing2, xiezhq-hermann, yhyang201 and yizhang2077 as code owners January 31, 2026 02:32

github-actions bot added documentation Improvements or additions to documentation lora labels Jan 31, 2026

github-actions bot added Multi-modal multi-modal language model hicache Hierarchical Caching for SGLang npu diffusion SGLang Diffusion labels Jan 31, 2026

doc update for CANN version (sgl-project#18014)

8d39ae2

Co-authored-by: wangtiance <tiancew@qq.com>

gemini-code-assist bot reviewed Jan 31, 2026

View reviewed changes

fix: revert conflict with latest

c40573d

xiezhq-hermann self-assigned this Jan 31, 2026

xiezhq-hermann reviewed Jan 31, 2026

View reviewed changes

xiezhq-hermann added the run-ci label Jan 31, 2026

ping1jing2 reviewed Jan 31, 2026

View reviewed changes

stmatengss reviewed Feb 1, 2026

View reviewed changes

hhu-scitix force-pushed the feature/simm branch from d4dc42e to c40573d Compare February 2, 2026 02:17

hhu-scitix added 2 commits February 2, 2026 11:11

feat: add log error info in simm hicache backend

c4ec8d5

feat: add simm dockerfile

dd90c13

hhu-scitix requested a review from ishandhanani as a code owner February 2, 2026 04:06

xiezhq-hermann approved these changes Feb 2, 2026

View reviewed changes

Merge branch 'main' into feature/simm

545eb34


		Before launch `SGLang server` with SiMM, you should launch SiMM `cluster manager service` and `data server service`.

		You can visit [SiMM official deploy guide]() and deploy SiMM on your K8S cluster with RDMA network.

Conversation

hhu-scitix commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Feature

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Jan 31, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiezhq-hermann commented Jan 31, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hhu-scitix Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

hhu-scitix commented Jan 31, 2026 •

edited

Loading

hhu-scitix Feb 2, 2026 •

edited

Loading