add tips to force NCCL comm to go through EFA#531
Closed
Conversation
Extend github actions bot to 60-90
…l reutrns outputs before proceeding with prometheus and slurm exporter installation Signed-off-by: nghtm <nghtm@amazon.com>
Move nccl into micro-benchmarks
- Removed unnecessary back-slash characters in array declarations, as they are not compatible with auto-resume.
…ssertion failure (#273) * Use specific version of Megatron-LM, to avoid FP32 assertion failure * Enable auto-resume on HyperPod * Ignore training input files under gpt2/ directory.
pcluster-fetch-config: show error messages
This is not required we should be using setup_conda_env.sh
Update SMP to never version plus fixed issue with pytorch installation failing.
Fixed conda setup failure for SMP.
* Remove ssh key * Add security group creation to LDAP server
Signed-off-by: Sean Smith <seaam@amazon.com>
Signed-off-by: Sean Smith <seaam@amazon.com>
Signed-off-by: Sean Smith <seaam@amazon.com>
Signed-off-by: Sean Smith <seaam@amazon.com>
* Change aws ofi plugin version 1.13.1 * Change EFA Installer to be inline with aws ofi plugin * Change aws ofi plugin version 1.13.2
* add an option to pull image * change default sqsh image location * update EKS nccl example to use prebuilt image by default * fix docker filename in an unit test * update to use the original TAG name in buildspec.yaml * point to specific version * make env vars spec consistent across orchastrators * rebert * Update nccl-tests-container.sbatch * Update README.md * Update micro-benchmarks/nccl-tests/README.md Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com> * update readme * revert change in image value field for nccl-tests.yaml * fix typo --------- Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>
Training plans logic added in to automate script.
Adding script to onboard EKS SMHP. Includes user creation, training plans, supports all deployment modes
* Update gen-keypair-ubuntu.sh to update authorized_keys IHAC who accidentally remove `authorized_keys` after cluster setup. That caused ``` 2024-12-26T09:09:03.929Z Generate a new keypair... 2024-12-26T09:09:04.180Z + ssh-keygen -t rsa -q -f id_rsa -N ‘’ 2024-12-26T09:09:04.180Z id_rsa already exists. 2024-12-26T09:09:04.430Z Overwrite (y/n)? Traceback (most recent call last): File “lifecycle_script.py”, line 232, in <module> main(args) File “lifecycle_script.py”, line 185, in main ExecuteBashScript(“./utils/gen-keypair-ubuntu.sh”).run() File “lifecycle_script.py”, line 31, in run result.check_returncode() File “/usr/lib/python3.8/subprocess.py”, line 448, in check_returncode raise CalledProcessError(self.returncode, self.args, self.stdout, 2024-12-26T09:09:08.979Z subprocess.CalledProcessError: Command ’[’sudo’, ‘bash’, ‘./utils/gen-keypair-ubuntu.sh’]' returned non-zero ``` when they tried to add a new node as `GENERATE_KEYPAIR=1` when we don't have the contents of `id_rsa.pub` in `authorized_keys`, even when the key pair does exist. * Update 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/gen-keypair-ubuntu.sh Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com> --------- Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>
Minor bug fix
Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>
mhuguesaws
reviewed
Jan 23, 2025
mhuguesaws
reviewed
Jan 23, 2025
mhuguesaws
reviewed
Jan 23, 2025
| * `S` : number of elements being communicated (similar to count for Algbw and Busbw) | ||
| * `B` : theoretical peak bandwidth. | ||
|
|
||
| ## 4. Tips and Tricks |
Contributor
There was a problem hiding this comment.
Please find another title.
mhuguesaws
reviewed
Jan 23, 2025
|
|
||
| ## 4. Tips and Tricks | ||
|
|
||
| This section demonstrates NCCL tests tips and tricks useful to diagnose cluster nodes. |
Contributor
There was a problem hiding this comment.
How do we know if this is bad or good?
Collaborator
Author
There was a problem hiding this comment.
If the test works till the end -> good. Otherwise bad?
Contributor
There was a problem hiding this comment.
We should set expected performance on single instance.
Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.