Skip to content

Conversation

harishvs
Copy link
Contributor

@harishvs harishvs commented Sep 7, 2025

Issue #, if available:

Description of changes:
Added support for topographical ordering of hostnames in mpi run
and made the slurm sbatch scripts more generic and added convenience scripts to convert nccl output to excel

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@harishvs
Copy link
Contributor Author

harishvs commented Sep 7, 2025

@amanshanbhag @nghtm

@harishvs harishvs marked this pull request as ready for review September 8, 2025 00:13
Copy link
Contributor

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. The tool looks useful, but shouldn't we create a separate directory for it instead of overwriting the existing Slurm example? Few workshops depend on the test case (e.g., https://catalog.workshops.aws/ml-on-aws-parallelcluster/en-US, https://catalog.workshops.aws/sagemaker-hyperpod/en-US). Also, we would like to show how to run the equivalent tests on both Slurm/Kubernetes using the current examples.
Left some initial comments.

@KeitaW KeitaW requested review from amanshanbhag and nghtm September 8, 2025 00:53
@harishvs harishvs force-pushed the main branch 2 times, most recently from f667e7f to f6e1716 Compare September 9, 2025 05:57
@harishvs
Copy link
Contributor Author

harishvs commented Sep 9, 2025

@KeitaW thanks for the detailed review. I moved my files to a v2 directory under slurm to preserve the the previous versions. I have addressed all your comments. please take a look.

@harishvs harishvs requested a review from KeitaW September 9, 2025 05:58
@nghtm
Copy link
Contributor

nghtm commented Sep 9, 2025

confirming @KeitaW and @amanshanbhag assigned for review

Copy link
Contributor

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating them. Few more comments. Could you please name the subdirectory in a bit more descriptive name instead of v2 (thanks for moving the scripts into the directory)? Maybe topology-aware-nccl-tests?

@harishvs harishvs force-pushed the main branch 3 times, most recently from 3e9ec98 to 2b7d45a Compare September 12, 2025 01:38
@harishvs harishvs force-pushed the main branch 3 times, most recently from 5d30390 to 0e65a1d Compare September 12, 2025 02:01
Comment on lines +57 to +66
docker build -f nccl-tests.Dockerfile \
--build-arg="EFA_INSTALLER_VERSION=${EFA_INSTALLER_VERSION}" \
--build-arg="AWS_OFI_NCCL_VERSION=${AWS_OFI_NCCL_VERSION}" \
--build-arg="NCCL_VERSION=${NCCL_VERSION}" \
--build-arg="NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}" \
-t ${CONTAINER_IMAGE_NAME_TAG} \
.
Copy link
Contributor

@KeitaW KeitaW Sep 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
docker build -f nccl-tests.Dockerfile \
--build-arg="EFA_INSTALLER_VERSION=${EFA_INSTALLER_VERSION}" \
--build-arg="AWS_OFI_NCCL_VERSION=${AWS_OFI_NCCL_VERSION}" \
--build-arg="NCCL_VERSION=${NCCL_VERSION}" \
--build-arg="NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}" \
-t ${CONTAINER_IMAGE_NAME_TAG} \
.
cd ../..
docker build -f nccl-tests.Dockerfile \
--build-arg="EFA_INSTALLER_VERSION=${EFA_INSTALLER_VERSION}" \
--build-arg="AWS_OFI_NCCL_VERSION=${AWS_OFI_NCCL_VERSION}" \
--build-arg="NCCL_VERSION=${NCCL_VERSION}" \
--build-arg="NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}" \
-t ${CONTAINER_IMAGE_NAME_TAG} \
.
cd -

Copy link
Contributor Author

@harishvs harishvs Sep 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the terminal, we have not asked the user to navigate to awsome-distributed-training/micro-benchmarks/nccl-tests/slurm/topology-aware-nccl-tests prior to this step, so we assume that the user is at the project root dir in the terminal and direct accordingly.

so i added
#Navigate to the slurm directory:
cd micro-benchmarks/nccl-tests/slurm/

docker build -f nccl-tests.Dockerfile
--build-arg="EFA_INSTALLER_VERSION=${EFA_INSTALLER_VERSION}"
--build-arg="AWS_OFI_NCCL_VERSION=${AWS_OFI_NCCL_VERSION}"
--build-arg="NCCL_VERSION=${NCCL_VERSION}"
--build-arg="NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}"
-t ${CONTAINER_IMAGE_NAME_TAG}
.


I also updated line 106 accordingly. 

Navigate to the topology-aware-nccl-tests directory:
```bash
cd topology-aware-nccl-tests

…e the slurm sbatch scripts more generic and added convenience

scripts to convert nccl output to excel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants