Skip to content

Conversation

@Rythsman
Copy link

@Rythsman Rythsman commented Dec 12, 2025

Motivation

Modifications

Accuracy Tests

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9545 ± 0.0057
strict-match 5 exact_match 0.9538 ± 0.0058

Benchmarking and Profiling

tp16:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     1         
Benchmark duration (s):                  322.81    
Total input tokens:                      8192      
Total input text tokens:                 8192      
Total input vision tokens:               0         
Total generated tokens:                  20480     
Total generated tokens (retokenized):    20480     
Request throughput (req/s):              0.00      
Input token throughput (tok/s):          25.38     
Output token throughput (tok/s):         63.44     
Peak output token throughput (tok/s):    67.00     
Peak concurrent requests:                1         
Total token throughput (tok/s):          88.82     
Concurrency:                             1.00      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   322467.57 
Median E2E Latency (ms):                 322467.57 
---------------Time to First Token----------------
Mean TTFT (ms):                          342.87    
Median TTFT (ms):                        342.87    
P99 TTFT (ms):                           342.87    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.73     
Median TPOT (ms):                        15.73     
P99 TPOT (ms):                           15.73     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           15.73     
Median ITL (ms):                         15.71     
P95 ITL (ms):                            16.39     
P99 ITL (ms):                            16.50     
Max ITL (ms):                            21.18     
==================================================

dcp8tp16:

============ Serving Benchmark Result ============                                                                                                                                                                                       
Backend:                                 sglang                                                                     
Traffic request rate:                    inf                                                                                                                                                                                             
Max request concurrency:                 not set                                                                                                                                                                                         
Successful requests:                     1                                                                          
Benchmark duration (s):                  332.06                                                                                                                                                                                          
Total input tokens:                      8192                                                                       
Total input text tokens:                 8192                                                                                                                                                                                            
Total input vision tokens:               0                                                                                                                                                                                               
Total generated tokens:                  20480                                                                      
Total generated tokens (retokenized):    20477                                                                      
Request throughput (req/s):              0.00                                                                                                                                                                                            
Input token throughput (tok/s):          24.67                                                                                                                                                                                           
Output token throughput (tok/s):         61.68                                                                                                                                                                                           
Peak output token throughput (tok/s):    63.00                                                                      
Peak concurrent requests:                1                                                                          
Total token throughput (tok/s):          86.35                                                                                                                                                                                           
Concurrency:                             1.00                                                                                                                                                                                            
----------------End-to-End Latency----------------                                                                                                                                                                                       
Mean E2E Latency (ms):                   331712.71                                                                                                                                                                                       
Median E2E Latency (ms):                 331712.71                                                                                                                                                                                       
---------------Time to First Token----------------                                                                  
Mean TTFT (ms):                          327.40                                                                                                                                                                                          
Median TTFT (ms):                        327.40                                                                                                                                                                                          
P99 TTFT (ms):                           327.40                                                                                                                                                                                          
-----Time per Output Token (excl. 1st token)------                                                                                                                                                                                       
Mean TPOT (ms):                          16.18                                                                                                                                                                                           
Median TPOT (ms):                        16.18                                                                                                                                                                                           
P99 TPOT (ms):                           16.18                                                                                                                                                                                           
---------------Inter-Token Latency----------------                                                                                                                                                                                       
Mean ITL (ms):                           16.18                                                                                                                                                                                           
Median ITL (ms):                         16.18                                                                                                                                                                                           
P95 ITL (ms):                            16.37                                                                                                                                                                                           
P99 ITL (ms):                            16.63                                                                                                                                                                                           
Max ITL (ms):                            19.34                                                                                                                                                                                           
==================================================  

Note

  1. tp16 with deepgemm will be hang if use enable-symm-memory
  2. dcp8tp16 need to add --enable-symm-mem in server_args to enable this feature

Checklist

Summary by Sourcery

Enable NCCL symmetric memory for DCP collectives and integrate it into distributed execution and environment configuration.

New Features:

  • Add opt-in support for using symmetric memory with DCP collective operations when --enable-symm-mem is set.

Enhancements:

  • Track per-group symmetric memory enablement in the parallel state so only the DCP group uses symmetric memory when DCP is enabled.
  • Route DCP reduce-scatter and intermediate tensor operations in DeepSeek v2 through the symmetric memory allocator to support NCCL symmetric memory kernels.
  • Include the DCP group in CUDA graph capture alongside existing tensor and pipeline parallel groups.
  • Simplify attention LSE all-gather logic in cp_lse_ag_out_rs by using the group all_gather result shape directly.
  • Configure NCCL_GRAPH_MIXING_SUPPORT=0 automatically for multi-rank DCP runs when symmetric memory is enabled to improve symmetric kernel performance.

@sourcery-ai
Copy link

sourcery-ai bot commented Dec 12, 2025

Reviewer's Guide

Adds NCCL symmetric memory support for DCP collective operations, wiring it through distributed groups, allocator policy, DeepSeek v2 attention paths, and environment configuration while tightening how collective buffers are allocated and captured.

Sequence diagram for DCP attention collectives with NCCL symmetric memory

sequenceDiagram
    actor User
    participant Engine as EngineEnvConfig
    participant Dist as GroupCoordinator_DCP
    participant Alloc as use_symmetric_memory
    participant Attn as DeepSeekV2Attention
    participant Utils as CPAttentionUtils

    User->>Engine: start_server(--enable-symm-mem, SGLANG_DCP>1)
    Engine->>Engine: set NCCL_NVLS_ENABLE
    Engine->>Engine: set NCCL_GRAPH_MIXING_SUPPORT=0 (DCP>1)

    Engine->>Dist: construct DCP group
    Dist->>Dist: read enable_symm_mem from server_args
    Dist->>Dist: read dcp_size from SGLANG_DCP
    Dist->>Dist: symm_mem_enabled_for_group = True (for DCP group)

    User->>Attn: run_forward()

    Attn->>Dist: get_dcp_group()
    Attn->>Alloc: use_symmetric_memory(Dist)
    Alloc->>Dist: check symm_mem_enabled_for_group and world_size>1
    Alloc-->>Attn: SymmetricMemoryContext
    Attn->>Attn: with SymmetricMemoryContext: torch.cat(q_pe, q_nope_out)

    Attn->>Dist: get_dcp_group().all_gather(combined)

    Attn->>Alloc: use_symmetric_memory(Dist)
    Alloc-->>Attn: SymmetricMemoryContext
    Attn->>Attn: with SymmetricMemoryContext: clone attn_output, lse

    Attn->>Utils: cp_lse_ag_out_rs(attn_output, lse, cp_group)
    Utils->>Dist: cp_group.all_gather(cp_attn_lse, dim=0)
    Utils->>Utils: correct_attn_out(...)
    Utils->>Dist: cp_group.reduce_scatter_along_dim(out, dim=1)
    Dist->>Alloc: use_symmetric_memory(Dist)
    Alloc-->>Dist: SymmetricMemoryContext
    Dist->>Dist: with SymmetricMemoryContext: allocate reduce_scatter output
    Dist->>Dist: reduce_scatter_tensor(output_tensor, input_tensor)
Loading

Updated class diagram for distributed groups and symmetric memory policy

classDiagram
    class GroupCoordinator {
        +device
        +device_module
        +world_size
        +device_group
        +symm_mem_enabled_for_group : bool
        +graph_capture(stream)
        +reduce_scatter_along_dim(input_tensor, dim, op)
        +reduce_scatter_tensor(output_tensor, input_tensor)
        +all_gather(tensor, dim)
        +rank_in_group : int
    }

    class SymmetricMemoryContext {
        +SymmetricMemoryContext(group_coordinator)
        +__enter__()
        +__exit__()
    }

    class PyncclAllocatorHelpers {
        +use_symmetric_memory(group_coordinator, disabled : bool) SymmetricMemoryContext|nullcontext
    }

    class DeepSeekV2Attention {
        +forward_absorb_prepare(...)
        +forward_absorb_core(...)
    }

    class CPTritonContext {
        +CPTritonContext()
    }

    class CPAttentionUtils {
        +cp_lse_ag_out_rs(cp_attn_out, cp_attn_lse, cp_group, ctx)
    }

    class EngineEnvConfig {
        +_set_envs_and_config(server_args)
    }

    GroupCoordinator "1" --> "1" SymmetricMemoryContext : creates
    PyncclAllocatorHelpers --> SymmetricMemoryContext : returns
    PyncclAllocatorHelpers --> GroupCoordinator : uses

    DeepSeekV2Attention --> PyncclAllocatorHelpers : use_symmetric_memory(get_dcp_group)
    DeepSeekV2Attention --> GroupCoordinator : get_dcp_group

    CPAttentionUtils --> CPTritonContext : creates
    CPAttentionUtils --> GroupCoordinator : cp_group.all_gather()
    CPAttentionUtils --> GroupCoordinator : cp_group.reduce_scatter_along_dim()

    EngineEnvConfig --> GroupCoordinator : config via env vars
    EngineEnvConfig --> PyncclAllocatorHelpers : influences NCCL behavior
Loading

Flow diagram for symmetric memory enablement per group and allocation

flowchart TD
    A[Start group initialization] --> B[Read enable_symm_mem from server_args]
    B --> C[Read dcp_size from SGLANG_DCP]
    C --> D{dcp_size > 1?}
    D -- Yes --> E{group_name == dcp?}
    D -- No --> F[Allow symmetric memory for any group]
    E -- Yes --> G[enable_symm_mem and DCP group]
    E -- No --> H[Disable symmetric memory for this group]
    G --> I[Set symm_mem_enabled_for_group = True]
    F --> I
    H --> J[Set symm_mem_enabled_for_group = False]
    I --> K[GroupCoordinator constructed]
    J --> K

    subgraph Allocation_path
        L["Call use_symmetric_memory(group_coordinator, disabled)"] --> M{disabled is True?}
        M -- Yes --> N[Return nullcontext]
        M -- No --> O{group_coordinator.symm_mem_enabled_for_group?}
        O -- No --> N
        O -- Yes --> P{group_coordinator.world_size == 1?}
        P -- Yes --> N
        P -- No --> Q[Return SymmetricMemoryContext]
    end
Loading

File-Level Changes

Change Details Files
Cache per-group symmetric memory enablement and use it for collective buffer allocation and graph capture in parallel_state.
  • Compute symm_mem_enabled_for_group in GroupCoordinator based on global server args, DCP size, and group name so only the DCP group uses symmetric memory when DCP>1.
  • Wrap reduce_scatter_along_dim output allocation in use_symmetric_memory to ensure buffers for reduce-scatter use the NCCL allocator when enabled.
  • Extend global graph_capture to also capture DCP group operations alongside TP and PP, sharing the same capture context.
python/sglang/srt/distributed/parallel_state.py
Tighten use_symmetric_memory to respect group-level flags and avoid symmetric allocation for non-participating or trivial groups.
  • Change disabled logic to check group_coordinator.symm_mem_enabled_for_group and world_size>1 instead of the global is_symmetric_memory_enabled flag.
  • Return a nullcontext when symmetric memory is disabled or the group is size 1, otherwise return SymmetricMemoryContext.
python/sglang/srt/distributed/device_communicators/pynccl_allocator.py
Ensure DeepSeek v2 DCP attention paths allocate and gather via NCCL symmetric memory when enabled.
  • Wrap q_pe/q_nope_out concatenation in forward_absorb_prepare with use_symmetric_memory(get_dcp_group()) so the combined tensor uses NCCL symmetric memory before all_gather.
  • In forward_absorb_core, when enable_symm_mem is on, clone attn_output and lse inside a use_symmetric_memory(get_dcp_group()) context to force NCCL allocator usage before cp_lse_ag_out_rs and subsequent DCP collectives.
  • Keep attention output reshaping and cp_lse_ag_out_rs call sequence unchanged except for the new allocator-aware clones.
python/sglang/srt/models/deepseek_v2.py
Simplify cp_lse_ag_out_rs to rely on group all_gather for allocation instead of manual empty allocation, aligning with symmetric memory usage.
  • Remove explicit torch.empty allocation for lses and pre-contiguity step for cp_attn_lse.
  • Use cp_group.all_gather(cp_attn_lse, dim=0) followed by a reshape to build lses, then run correct_attn_out and reduce_scatter_along_dim as before.
python/sglang/srt/layers/attention/utils.py
Adjust NCCL runtime environment when symmetric memory is enabled, particularly for DCP multi-rank configurations.
  • When enable_symm_mem is on (or NCCL_GRAPH_MIXING_SUPPORT is unset) and SGLANG_DCP>1, set NCCL_GRAPH_MIXING_SUPPORT=0 to improve performance of symmetric kernels.
  • Keep NCCL_NVLS_ENABLE tied to enable_nccl_nvls or enable_symm_mem and leave other CUDA env vars unchanged.
python/sglang/srt/entrypoints/engine.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@gemini-code-assist
Copy link

Summary of Changes

Hello @Rythsman, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for NCCL symmetric memory in distributed collective operations, primarily targeting the Data Parallel (DCP) group. By conditionally enabling symmetric memory and optimizing related environment variables, the changes aim to enhance the performance of distributed tensor operations within the SGLang framework, particularly affecting memory allocation and data transfer efficiency during model execution.

Highlights

  • Symmetric Memory Control: Introduced a new symm_mem_enabled_for_group flag within GroupCoordinator to precisely control when NCCL symmetric memory is used for collective operations, especially for the Data Parallel (DCP) group.
  • Performance Optimization: Configured the NCCL_GRAPH_MIXING_SUPPORT environment variable to "0" when symmetric memory is enabled and DCP is active, aiming to improve performance for symmetric kernels.
  • Integration into Collective Operations: Applied symmetric memory context to tensor allocation in reduce_scatter_along_dim and to torch.cat and clone operations within the DeepSeekV2 model's attention mechanisms.
  • Graph Capture Extension: Extended the graph_capture mechanism to include the DCP group, ensuring proper synchronization and memory management during graph-based execution.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • In GroupCoordinator.reduce_scatter_along_dim, the context is entered with with self.use_symmetric_memory(self):; if use_symmetric_memory is already a bound method that wraps the allocator helper, this extra self argument is likely redundant or incorrect and could be simplified to match other call sites.
  • The symm_mem_enabled_for_group flag currently relies on group_name == "dcp" and SGLANG_DCP; consider centralizing this DCP-detection logic (and avoiding hard-coded group-name strings) so that future changes to group naming or DCP configuration don’t silently break symmetric-memory routing.
  • In deepseek_v2.forward_absorb_core, the clone() operations for attn_output and lse under enable_symm_mem are on a hot path; if possible, reuse buffers or ensure they are allocated with the NCCL allocator earlier to avoid extra allocations and copies for every forward pass.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `GroupCoordinator.reduce_scatter_along_dim`, the context is entered with `with self.use_symmetric_memory(self):`; if `use_symmetric_memory` is already a bound method that wraps the allocator helper, this extra `self` argument is likely redundant or incorrect and could be simplified to match other call sites.
- The `symm_mem_enabled_for_group` flag currently relies on `group_name == "dcp"` and `SGLANG_DCP`; consider centralizing this DCP-detection logic (and avoiding hard-coded group-name strings) so that future changes to group naming or DCP configuration don’t silently break symmetric-memory routing.
- In `deepseek_v2.forward_absorb_core`, the `clone()` operations for `attn_output` and `lse` under `enable_symm_mem` are on a hot path; if possible, reuse buffers or ensure they are allocated with the NCCL allocator earlier to avoid extra allocations and copies for every forward pass.

## Individual Comments

### Comment 1
<location> `python/sglang/srt/distributed/parallel_state.py:321-325` </location>
<code_context>
+        # - When enable_symm_mem is on and DCP is enabled (SGLANG_DCP > 1),
+        #   only the DCP group should use SymmetricMemoryContext.
+        # - When DCP is disabled, keep the original behavior.
+        try:
+            from sglang.srt.server_args import get_global_server_args
+
+            enable_symm_mem = bool(get_global_server_args().enable_symm_mem)
+        except Exception:
+            enable_symm_mem = False
+        try:
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Catching all Exceptions when reading `enable_symm_mem` may hide configuration errors.

Using `except Exception:` means any bug in `get_global_server_args()` (e.g., typos, import errors) will silently disable symmetric memory. Consider catching only the specific expected exceptions (such as `ImportError`/`AttributeError`) and/or logging when the fallback is used so configuration issues are detectable.

Suggested implementation:

```python
import logging
import os

logger = logging.getLogger(__name__)

```

```python
        # Cache symmetric memory policy for this group.
        # - When enable_symm_mem is on and DCP is enabled (SGLANG_DCP > 1),
        #   only the DCP group should use SymmetricMemoryContext.
        # - When DCP is disabled, keep the original behavior.
        try:
            from sglang.srt.server_args import get_global_server_args

            enable_symm_mem = bool(get_global_server_args().enable_symm_mem)
        except (ImportError, AttributeError) as exc:
            logger.warning(
                "Symmetric memory disabled due to configuration issue reading "
                "enable_symm_mem: %s",
                exc,
            )
            enable_symm_mem = False
        try:
            dcp_size = int(os.getenv("SGLANG_DCP", "1") or "1")
        except Exception:
            dcp_size = 1
        self.symm_mem_enabled_for_group = bool(

```

If `import os` is not present or is in a different location in `parallel_state.py`, adjust the first SEARCH/REPLACE block so that `import logging` and the `logger = logging.getLogger(__name__)` initialization are inserted alongside the existing top-level imports for this module.
</issue_to_address>

### Comment 2
<location> `python/sglang/srt/distributed/device_communicators/pynccl_allocator.py:186` </location>
<code_context>
-        or disabled
+    if (
+        disabled
+        or not getattr(group_coordinator, "symm_mem_enabled_for_group", False)
         or group_coordinator.world_size == 1
-    )
</code_context>

<issue_to_address>
**question (bug_risk):** Switching from a global `is_symmetric_memory_enabled()` check to a per-group attribute may change behavior for other coordinators.

This now bases `use_symmetric_memory` solely on `group_coordinator.symm_mem_enabled_for_group`, defaulting to `False` if absent. Callers that previously relied on the global `is_symmetric_memory_enabled()` but don’t set this attribute will lose symmetric memory with no warning. If that’s intended, please ensure all coordinator types initialize `symm_mem_enabled_for_group`; otherwise, consider OR-ing this attribute with the global flag to keep existing behavior.
</issue_to_address>

### Comment 3
<location> `python/sglang/srt/layers/attention/utils.py:365` </location>
<code_context>
-        (cp_group.world_size,) + cp_attn_lse.shape,
-        dtype=cp_attn_lse.dtype,
-        device=cp_attn_lse.device,
+    lses = cp_group.all_gather(cp_attn_lse, dim=0).view(
+        (cp_group.world_size,) + cp_attn_lse.shape
     )
</code_context>

<issue_to_address>
**suggestion (performance):** Removing the explicit `contiguous()` on `cp_attn_lse` may affect performance or assumptions in `all_gather`.

`cp_attn_lse` is now passed to `all_gather` without being made contiguous first, so non-contiguous inputs may incur extra internal copies in NCCL and change performance characteristics. If non-contiguous tensors are expected here, consider retaining `.contiguous()` (or enforcing contiguity at the call site) to avoid hidden perf regressions.
</issue_to_address>

### Comment 4
<location> `python/sglang/srt/entrypoints/engine.py:701-706` </location>
<code_context>
         os.environ["NCCL_NVLS_ENABLE"] = str(
             int(server_args.enable_nccl_nvls or server_args.enable_symm_mem)
         )
+    if "NCCL_GRAPH_MIXING_SUPPORT" not in os.environ or server_args.enable_symm_mem:
+        dcp_size = int(os.getenv("SGLANG_DCP", "1") or "1")
+        # Note(wh): NCCL_GRAPH_MIXING_SUPPORT=0 can help improve performance for symmetric kernels.
+        # details in https://github.com/NVIDIA/nccl-tests/issues/333#issuecomment-3103636985
+        if dcp_size > 1:
+            os.environ["NCCL_GRAPH_MIXING_SUPPORT"] = "0"
     os.environ["CUDA_DEVICE_MAX_CONNECTIONS"] = "8"
     os.environ["CUDA_MODULE_LOADING"] = "AUTO"
</code_context>

<issue_to_address>
**question:** Overriding user-provided `NCCL_GRAPH_MIXING_SUPPORT` when `enable_symm_mem` is true might be surprising.

This logic will override an explicitly set `NCCL_GRAPH_MIXING_SUPPORT` whenever `enable_symm_mem` is true and `dcp_size > 1`. Since this is a performance tuning knob, consider only setting it when the env var is unset, or clearly documenting that `enable_symm_mem` intentionally takes precedence over user configuration.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +321 to +325
try:
from sglang.srt.server_args import get_global_server_args

enable_symm_mem = bool(get_global_server_args().enable_symm_mem)
except Exception:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Catching all Exceptions when reading enable_symm_mem may hide configuration errors.

Using except Exception: means any bug in get_global_server_args() (e.g., typos, import errors) will silently disable symmetric memory. Consider catching only the specific expected exceptions (such as ImportError/AttributeError) and/or logging when the fallback is used so configuration issues are detectable.

Suggested implementation:

import logging
import os

logger = logging.getLogger(__name__)
        # Cache symmetric memory policy for this group.
        # - When enable_symm_mem is on and DCP is enabled (SGLANG_DCP > 1),
        #   only the DCP group should use SymmetricMemoryContext.
        # - When DCP is disabled, keep the original behavior.
        try:
            from sglang.srt.server_args import get_global_server_args

            enable_symm_mem = bool(get_global_server_args().enable_symm_mem)
        except (ImportError, AttributeError) as exc:
            logger.warning(
                "Symmetric memory disabled due to configuration issue reading "
                "enable_symm_mem: %s",
                exc,
            )
            enable_symm_mem = False
        try:
            dcp_size = int(os.getenv("SGLANG_DCP", "1") or "1")
        except Exception:
            dcp_size = 1
        self.symm_mem_enabled_for_group = bool(

If import os is not present or is in a different location in parallel_state.py, adjust the first SEARCH/REPLACE block so that import logging and the logger = logging.getLogger(__name__) initialization are inserted alongside the existing top-level imports for this module.

or disabled
if (
disabled
or not getattr(group_coordinator, "symm_mem_enabled_for_group", False)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (bug_risk): Switching from a global is_symmetric_memory_enabled() check to a per-group attribute may change behavior for other coordinators.

This now bases use_symmetric_memory solely on group_coordinator.symm_mem_enabled_for_group, defaulting to False if absent. Callers that previously relied on the global is_symmetric_memory_enabled() but don’t set this attribute will lose symmetric memory with no warning. If that’s intended, please ensure all coordinator types initialize symm_mem_enabled_for_group; otherwise, consider OR-ing this attribute with the global flag to keep existing behavior.

(cp_group.world_size,) + cp_attn_lse.shape,
dtype=cp_attn_lse.dtype,
device=cp_attn_lse.device,
lses = cp_group.all_gather(cp_attn_lse, dim=0).view(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): Removing the explicit contiguous() on cp_attn_lse may affect performance or assumptions in all_gather.

cp_attn_lse is now passed to all_gather without being made contiguous first, so non-contiguous inputs may incur extra internal copies in NCCL and change performance characteristics. If non-contiguous tensors are expected here, consider retaining .contiguous() (or enforcing contiguity at the call site) to avoid hidden perf regressions.

Comment on lines +701 to +706
if "NCCL_GRAPH_MIXING_SUPPORT" not in os.environ or server_args.enable_symm_mem:
dcp_size = int(os.getenv("SGLANG_DCP", "1") or "1")
# Note(wh): NCCL_GRAPH_MIXING_SUPPORT=0 can help improve performance for symmetric kernels.
# details in https://github.com/NVIDIA/nccl-tests/issues/333#issuecomment-3103636985
if dcp_size > 1:
os.environ["NCCL_GRAPH_MIXING_SUPPORT"] = "0"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Overriding user-provided NCCL_GRAPH_MIXING_SUPPORT when enable_symm_mem is true might be surprising.

This logic will override an explicitly set NCCL_GRAPH_MIXING_SUPPORT whenever enable_symm_mem is true and dcp_size > 1. Since this is a performance tuning knob, consider only setting it when the env var is unset, or clearly documenting that enable_symm_mem intentionally takes precedence over user configuration.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for using NCCL symmetric memory for DCP collective operations, which is a good performance enhancement. The changes are well-structured and correctly implement the feature. My review includes a few suggestions to improve code quality and maintainability, such as refining exception handling, removing code duplication, and eliminating a redundant operation. Overall, the implementation is solid.

Comment on lines +321 to +330
try:
from sglang.srt.server_args import get_global_server_args

enable_symm_mem = bool(get_global_server_args().enable_symm_mem)
except Exception:
enable_symm_mem = False
try:
dcp_size = int(os.getenv("SGLANG_DCP", "1") or "1")
except Exception:
dcp_size = 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using except Exception is too broad and can hide unexpected errors, making debugging more difficult. It's better to catch more specific exceptions. For get_global_server_args(), it raises a ValueError if not initialized. For int(), it raises a ValueError for invalid string conversions. Consider catching ValueError and other potential specific exceptions like ImportError or AttributeError.

        except (ImportError, AttributeError, ValueError):
            enable_symm_mem = False
        try:
            dcp_size = int(os.getenv("SGLANG_DCP", "1") or "1")
        except ValueError:
            dcp_size = 1

int(server_args.enable_nccl_nvls or server_args.enable_symm_mem)
)
if "NCCL_GRAPH_MIXING_SUPPORT" not in os.environ or server_args.enable_symm_mem:
dcp_size = int(os.getenv("SGLANG_DCP", "1") or "1")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to get dcp_size from the environment variable SGLANG_DCP is duplicated from python/sglang/srt/distributed/parallel_state.py. To improve maintainability and avoid code duplication, consider creating a utility function for this logic in a shared module.

Comment on lines +2178 to 2186
if get_global_server_args().enable_symm_mem:
# Note(wh): make sure input tensors use nccl allocator
with use_symmetric_memory(get_dcp_group()):
attn_output = attn_output.clone(
memory_format=torch.contiguous_format
)
lse = lse.clone(memory_format=torch.contiguous_format)
attn_output = attn_output.contiguous()
attn_output = cp_lse_ag_out_rs(attn_output, lse, get_dcp_group())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call to attn_output.contiguous() on line 2185 is redundant when get_global_server_args().enable_symm_mem is true, because attn_output is already cloned with memory_format=torch.contiguous_format inside the if block. To avoid this redundancy and make the intent clearer, you can move the .contiguous() call into an else block.

Suggested change
if get_global_server_args().enable_symm_mem:
# Note(wh): make sure input tensors use nccl allocator
with use_symmetric_memory(get_dcp_group()):
attn_output = attn_output.clone(
memory_format=torch.contiguous_format
)
lse = lse.clone(memory_format=torch.contiguous_format)
attn_output = attn_output.contiguous()
attn_output = cp_lse_ag_out_rs(attn_output, lse, get_dcp_group())
if get_global_server_args().enable_symm_mem:
# Note(wh): make sure input tensors use nccl allocator
with use_symmetric_memory(get_dcp_group()):
attn_output = attn_output.clone(
memory_format=torch.contiguous_format
)
lse = lse.clone(memory_format=torch.contiguous_format)
else:
attn_output = attn_output.contiguous()
attn_output = cp_lse_ag_out_rs(attn_output, lse, get_dcp_group())

@staugust
Copy link
Collaborator

| msg_size   |   [AllGather] torch eager time |   [AllGather] pynccl symm graph time |   [ReduceScatter] pynccl eager time |   [ReduceScatter] pynccl symm graph time |
|------------|--------------------------------|--------------------------------------|-------------------------------------|------------------------------------------|
| 2.0 KiB    |                       171.078  |                              2.7235  |                             27.1488 |                                  2.68881 |
| 4.0 KiB    |                        27.7056 |                              2.77451 |                             39.3312 |                                  2.92092 |
| 8.0 KiB    |                        23.4464 |                              3.02163 |                             22.8224 |                                  2.87033 |
| 16.0 KiB   |                        19.4848 |                              4.33491 |                             19.712  |                                  3.01983 |
| 32.0 KiB   |                        25.4816 |                              4.68513 |                             21.12   |                                  3.21582 |
| 64.0 KiB   |                        22.1152 |                              5.47508 |                             22.7968 |                                  3.27826 |

@Rythsman
Copy link
Author

| msg_size   |   [AllGather] torch eager time |   [AllGather] pynccl symm graph time |   [ReduceScatter] pynccl eager time |   [ReduceScatter] pynccl symm graph time |
|------------|--------------------------------|--------------------------------------|-------------------------------------|------------------------------------------|
| 2.0 KiB    |                       171.078  |                              2.7235  |                             27.1488 |                                  2.68881 |
| 4.0 KiB    |                        27.7056 |                              2.77451 |                             39.3312 |                                  2.92092 |
| 8.0 KiB    |                        23.4464 |                              3.02163 |                             22.8224 |                                  2.87033 |
| 16.0 KiB   |                        19.4848 |                              4.33491 |                             19.712  |                                  3.01983 |
| 32.0 KiB   |                        25.4816 |                              4.68513 |                             21.12   |                                  3.21582 |
| 64.0 KiB   |                        22.1152 |                              5.47508 |                             22.7968 |                                  3.27826 |

看起来和我这里差不多,如果现在整体启用没效果的话,打个timeline,发我来分析下~

@staugust
Copy link
Collaborator

@Rythsman 我看了下测试脚本,这个对比合适吗? symm_mem的意思应该是分配tensor显存的逻辑不一样,这两个测试一个用了cuda graph,一个没用,感觉变量有点多。 我再本地跑个对比看看。

…n is incompatible with TP group symm-mem. Modifications will be made after the resolution of the multi-group symmetric memory coexistence issue.)

misc: remove unneed code after rebase

fix: fix ar coredump when dcp use symmetric memory

fea: add symm-mem unit perf test
@staugust staugust merged commit b5a4378 into antgroup:yjh/dcp-dev-main Jan 6, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants