-
Notifications
You must be signed in to change notification settings - Fork 143
Added support for topographical ordering of hostnames in mpi run #846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. The tool looks useful, but shouldn't we create a separate directory for it instead of overwriting the existing Slurm example? Few workshops depend on the test case (e.g., https://catalog.workshops.aws/ml-on-aws-parallelcluster/en-US, https://catalog.workshops.aws/sagemaker-hyperpod/en-US). Also, we would like to show how to run the equivalent tests on both Slurm/Kubernetes using the current examples.
Left some initial comments.
f667e7f
to
f6e1716
Compare
@KeitaW thanks for the detailed review. I moved my files to a v2 directory under slurm to preserve the the previous versions. I have addressed all your comments. please take a look. |
confirming @KeitaW and @amanshanbhag assigned for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating them. Few more comments. Could you please name the subdirectory in a bit more descriptive name instead of v2 (thanks for moving the scripts into the directory)? Maybe topology-aware-nccl-tests
?
micro-benchmarks/nccl-tests/slurm/v2/nccl-tests-container.sbatch
Outdated
Show resolved
Hide resolved
3e9ec98
to
2b7d45a
Compare
5d30390
to
0e65a1d
Compare
micro-benchmarks/nccl-tests/slurm/topology-aware-nccl-tests/README.md
Outdated
Show resolved
Hide resolved
docker build -f nccl-tests.Dockerfile \ | ||
--build-arg="EFA_INSTALLER_VERSION=${EFA_INSTALLER_VERSION}" \ | ||
--build-arg="AWS_OFI_NCCL_VERSION=${AWS_OFI_NCCL_VERSION}" \ | ||
--build-arg="NCCL_VERSION=${NCCL_VERSION}" \ | ||
--build-arg="NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}" \ | ||
-t ${CONTAINER_IMAGE_NAME_TAG} \ | ||
. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docker build -f nccl-tests.Dockerfile \ | |
--build-arg="EFA_INSTALLER_VERSION=${EFA_INSTALLER_VERSION}" \ | |
--build-arg="AWS_OFI_NCCL_VERSION=${AWS_OFI_NCCL_VERSION}" \ | |
--build-arg="NCCL_VERSION=${NCCL_VERSION}" \ | |
--build-arg="NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}" \ | |
-t ${CONTAINER_IMAGE_NAME_TAG} \ | |
. | |
cd ../.. | |
docker build -f nccl-tests.Dockerfile \ | |
--build-arg="EFA_INSTALLER_VERSION=${EFA_INSTALLER_VERSION}" \ | |
--build-arg="AWS_OFI_NCCL_VERSION=${AWS_OFI_NCCL_VERSION}" \ | |
--build-arg="NCCL_VERSION=${NCCL_VERSION}" \ | |
--build-arg="NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}" \ | |
-t ${CONTAINER_IMAGE_NAME_TAG} \ | |
. | |
cd - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the terminal, we have not asked the user to navigate to awsome-distributed-training/micro-benchmarks/nccl-tests/slurm/topology-aware-nccl-tests prior to this step, so we assume that the user is at the project root dir in the terminal and direct accordingly.
so i added
#Navigate to the slurm directory:
cd micro-benchmarks/nccl-tests/slurm/
docker build -f nccl-tests.Dockerfile
--build-arg="EFA_INSTALLER_VERSION=${EFA_INSTALLER_VERSION}"
--build-arg="AWS_OFI_NCCL_VERSION=${AWS_OFI_NCCL_VERSION}"
--build-arg="NCCL_VERSION=${NCCL_VERSION}"
--build-arg="NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}"
-t ${CONTAINER_IMAGE_NAME_TAG}
.
I also updated line 106 accordingly.
Navigate to the topology-aware-nccl-tests directory:
```bash
cd topology-aware-nccl-tests
micro-benchmarks/nccl-tests/slurm/topology-aware-nccl-tests/README.md
Outdated
Show resolved
Hide resolved
…e the slurm sbatch scripts more generic and added convenience scripts to convert nccl output to excel
Issue #, if available:
Description of changes:
Added support for topographical ordering of hostnames in mpi run
and made the slurm sbatch scripts more generic and added convenience scripts to convert nccl output to excel
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.