Skip to content

add tips to force NCCL comm to go through EFA#531

Closed
KeitaW wants to merge 1112 commits intomainfrom
add-tips-to-nccl-test
Closed

add tips to force NCCL comm to go through EFA#531
KeitaW wants to merge 1112 commits intomainfrom
add-tips-to-nccl-test

Conversation

@KeitaW
Copy link
Collaborator

@KeitaW KeitaW commented Jan 23, 2025

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

perifaws and others added 30 commits April 15, 2024 12:03
Extend github actions bot to 60-90
…l reutrns outputs before proceeding with prometheus and slurm exporter installation

Signed-off-by: nghtm <nghtm@amazon.com>
- Removed unnecessary back-slash characters in array declarations, as they are not compatible with auto-resume.
…ssertion failure (#273)

* Use specific version of Megatron-LM, to avoid FP32 assertion failure

* Enable auto-resume on HyperPod

* Ignore training input files under gpt2/ directory.
pcluster-fetch-config: show error messages
This is not required we should be using setup_conda_env.sh
Update SMP to never version plus fixed issue with pytorch installation failing.
Fixed conda setup failure for SMP.
mhuguesaws and others added 17 commits December 6, 2024 15:47
* Remove ssh key

* Add security group creation to LDAP server
Signed-off-by: Sean Smith <seaam@amazon.com>
Signed-off-by: Sean Smith <seaam@amazon.com>
Signed-off-by: Sean Smith <seaam@amazon.com>
Signed-off-by: Sean Smith <seaam@amazon.com>
* Change aws ofi plugin version 1.13.1

* Change EFA Installer to be inline with aws ofi plugin

* Change aws ofi plugin version 1.13.2
Signed-off-by: Sean Smith <seaam@amazon.com>
* add an option to pull image

* change default sqsh image location

* update EKS nccl example to use prebuilt image by default

* fix docker filename in an unit test

* update to use the original TAG name in buildspec.yaml

* point to specific version

* make env vars spec consistent across orchastrators

* rebert

* Update nccl-tests-container.sbatch

* Update README.md

* Update micro-benchmarks/nccl-tests/README.md

Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>

* update readme

* revert change in image value field for nccl-tests.yaml

* fix typo

---------

Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>
Training plans logic added in to automate script.
Adding script to onboard EKS SMHP. Includes user creation, training plans, supports all deployment modes
* Update gen-keypair-ubuntu.sh to update authorized_keys

IHAC who accidentally remove `authorized_keys` after cluster setup. That caused
```
2024-12-26T09:09:03.929Z
Generate a new keypair...
2024-12-26T09:09:04.180Z
+ ssh-keygen -t rsa -q -f id_rsa -N ‘’
2024-12-26T09:09:04.180Z
id_rsa already exists.
2024-12-26T09:09:04.430Z
Overwrite (y/n)? Traceback (most recent call last): File “lifecycle_script.py”, line 232, in <module> main(args) File “lifecycle_script.py”, line 185, in main ExecuteBashScript(“./utils/gen-keypair-ubuntu.sh”).run() File “lifecycle_script.py”, line 31, in run result.check_returncode() File “/usr/lib/python3.8/subprocess.py”, line 448, in check_returncode raise CalledProcessError(self.returncode, self.args, self.stdout,
2024-12-26T09:09:08.979Z
subprocess.CalledProcessError: Command ’[’sudo’, ‘bash’, ‘./utils/gen-keypair-ubuntu.sh’]' returned non-zero
```
when they tried to add a new node as `GENERATE_KEYPAIR=1` when we don't have the contents of `id_rsa.pub` in `authorized_keys`, even when the key pair does exist.

* Update 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/gen-keypair-ubuntu.sh

Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>

---------

Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>
Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>
@KeitaW KeitaW requested a review from mhuguesaws January 23, 2025 01:32
* `S` : number of elements being communicated (similar to count for Algbw and Busbw)
* `B` : theoretical peak bandwidth.

## 4. Tips and Tricks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please find another title.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other usages?


## 4. Tips and Tricks

This section demonstrates NCCL tests tips and tricks useful to diagnose cluster nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know if this is bad or good?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the test works till the end -> good. Otherwise bad?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should set expected performance on single instance.

Copy link
Contributor

@mhuguesaws mhuguesaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments.

Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.