Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] change to cross nic=2 to allow for alternating ring algo and nccl==2.23.4 #48

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

OrenLeung
Copy link

@OrenLeung OrenLeung commented Feb 26, 2025

  • change to NCCL_CROSS_NIC=2
  • update from very old nccl==2.19.4 in ngc 24.01 to nccl==2.23.4 in ngc 24.12
  • change to QPS_PER_CONNECTION=1 when within the same rail group as there is no hash collisions within the same rail group
  • TODO: add note about needing more QPs when about 1 tier of switching to increase enthropy
  • remove nccl topo since NCCL graph search should be able to auto generate the topo on OCI's bare metal instances
  • remove NCCL_NET_PLUGIN=none

max BW with alternating ring is 390GByte/s without it is 370GByte/s according to Sylvain's GTC24 NCCL talk

image
image

Copy link

Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA).
The following contributors of this PR have not signed the OCA:

To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application.

When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated.

If you are an Oracle employee, please make sure that you are a member of the main Oracle GitHub organization, and your membership in this organization is public.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. label Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Required At least one contributor does not have an approved Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant