Skip to content

Conversation

@vcherepanov-nv
Copy link
Collaborator

@vcherepanov-nv vcherepanov-nv commented Dec 16, 2025

Description

cuBLASMp 0.6 and later uses NCCL fallback for GEMM+AR if NVL / multicast is not available.
This change removes test skip logic.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Don't skip tests, even if multicast is not supported

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Vladimir Cherepanov <[email protected]>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 16, 2025

Greptile Overview

Greptile Summary

Removes the multicast support check from GemmAr test fixture. cuBLASMp 0.6+ automatically falls back to NCCL when multicast/NVL hardware is unavailable, making the test skip logic unnecessary.

Key changes:

  • Removed IsMulticastSupported() helper function
  • Removed SetUp() override in GemmAr that skipped tests on devices without multicast support
  • Tests will now run on all devices, relying on cuBLASMp's internal fallback mechanism

Confidence Score: 4/5

  • This PR is safe to merge with low risk - it simplifies test logic by removing unnecessary skip conditions
  • The change correctly removes test skip logic that is no longer needed with cuBLASMp 0.6+. However, there's a potential concern: the PR assumes all environments have cuBLASMp >= 0.6, but doesn't verify this at build/test time. If older cuBLASMp versions are still in use, tests could fail on hardware without multicast support
  • No files require special attention - the change is straightforward

Important Files Changed

File Analysis

Filename Score Overview
tests/cpp_distributed/test_comm_gemm.cu 4/5 Removed multicast support check for GemmAr tests, allowing tests to run on all devices by relying on cuBLASMp 0.6+ NCCL fallback

Sequence Diagram

sequenceDiagram
    participant Test as GemmAr Test
    participant GemmAr as GemmAr Fixture
    participant NVTE as nvte_gemm_all_reduce
    participant cuBLASMp as cuBLASMp Library
    participant HW as Hardware/Device
    
    Note over Test,HW: Before this PR (cuBLASMp < 0.6)
    Test->>GemmAr: SetUp()
    GemmAr->>HW: Check multicast support
    alt Multicast not supported
        GemmAr->>Test: GTEST_SKIP()
        Note over Test: Test skipped
    else Multicast supported
        GemmAr->>Test: Continue
        Test->>NVTE: Run gemm_all_reduce
        NVTE->>cuBLASMp: Execute GEMM+AR with multicast
    end
    
    Note over Test,HW: After this PR (cuBLASMp >= 0.6)
    Test->>GemmAr: SetUp() [no override]
    GemmAr->>Test: Continue (no check)
    Test->>NVTE: Run gemm_all_reduce
    NVTE->>cuBLASMp: Execute GEMM+AR
    alt Hardware has multicast/NVL
        cuBLASMp->>HW: Use multicast
    else No multicast support
        cuBLASMp->>cuBLASMp: Fallback to NCCL
        cuBLASMp->>HW: Use NCCL allreduce
    end
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. tests/cpp_distributed/test_comm_gemm.cu, line 373 (link)

    style: if CUBLASMP_VERSION is not defined by the cuBLASMp headers, this will evaluate to #if 0 < 600 (true), causing the old behavior. Consider adding a check like:

    #if !defined(CUBLASMP_VERSION) || CUBLASMP_VERSION < 600

    to make the intent explicit and avoid silent failures if the macro isn't defined.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Signed-off-by: Vladimir Cherepanov <[email protected]>
@vcherepanov-nv vcherepanov-nv changed the title Use GEMM-AR fallback on newer cuBLASMp Remove test skip logic for GEMM-AR tests Dec 16, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. tests/cpp_distributed/test_comm_gemm.cu, line 336-366 (link)

    style: Consider verifying cuBLASMp >= 0.6 at build time. If older versions are still supported, tests will fail on hardware without multicast.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vcherepanov-nv vcherepanov-nv merged commit eac8af6 into NVIDIA:main Dec 16, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants