Releases: awslabs/awsome-distributed-ai
v2.0.0-pre-reorg — pre-reorganization snapshot
Snapshot of main immediately before the repository reorganization (#1056).
This release preserves the legacy numbered directory layout (0.docs/, 1.architectures/, 2.ami_and_containers/, 3.test_cases/, 4.validation_and_observability/, including 1.architectures/5.sagemaker-hyperpod/) at a stable ref, so external consumers can pin to it while main is reorganized.
Why this exists
The SageMaker HyperPod console reads cluster lifecycle scripts directly from numbered paths on main (e.g. 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/). Pinning the console's CloudFormation to this tag keeps it working with zero downtime while the reorganization (PR #1119) removes the numeric prefixes and restructures examples/ into training/ + use-cases/.
Pinning anchor
https://github.com/awslabs/awsome-distributed-ai/tree/v2.0.0-pre-reorg
Raw lifecycle scripts remain available under, e.g.:
https://raw.githubusercontent.com/awslabs/awsome-distributed-ai/v2.0.0-pre-reorg/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/...
Lifecycle
Once the HyperPod service is repointed to the new paths on main (post-reorg), this tag can be retired.
Targets main at commit 0edaf1526bc47185ce5f827100416f99bd7b69e1 (HEAD prior to the reorg).
v1.2.0
What's Changed
- Update
eksctlcluster versions and ML CBR usage by @bryantbiggs in #573 - Deleting SMP/SMDDP test-cases by @shimomut in #617
- adding picotron by @KeitaW in #584
- Update readme, deprecate test cases, and move Pytorch test cases under pytorch subdirectory by @KeitaW in #620
- Add EKS node autorepair example cluster manifest by @iankouls-aws in #619
- added AmazonEKS_CNI_Policy to SM Exec Role by @bluecrayon52 in #624
- Reduce efa exporter container images by @mhuguesaws in #611
- Change EFA, NCCL version in pipeline by @mhuguesaws in #626
- added DOCKER_NETWORK and env_var persistence for SageMaker Code Editor use at AWS Events by @bluecrayon52 in #623
- updated fsx_ubuntu.sh script with wait loop by @bluecrayon52 in #633
- Change PyTorch version for FSDP case and remove conda by @mhuguesaws in #629
- Change prometheus version for SMHP by @mhuguesaws in #628
- Openzfs smhp by @amanshanbhag in #622
- Fix cloudwatch access from Grafana by @mhuguesaws in #627
- Fixing recently raised Studio Issues by @amanshanbhag in #640
- Terraform Modules for HyperPod EKS by @bluecrayon52 in #586
- Slurm cluster creation issues by @amanshanbhag in #641
- Update 0.distributed-training.Dockerfile by @KeitaW in #645
- Improvements/fsdp restructure by @mhuguesaws in #630
- Add automated Grafana dashboard deployment by @mhuguesaws in #607
- Fix FSDP to use venv first by @mhuguesaws in #650
- nvshmem by @pbelevich in #599
- Update install_enroot_pyxis.sh by @KeitaW in #661
- feat: Add Hyperpod Optimum-neuron LoRA example by @Captainia in #631
- Adding custom dcgm metrics for EKS by @nadknish in #666
- re-adding deepspeed by @KeitaW in #659
- Lcc studio jl by @amanshanbhag in #669
- Update 0.distributed-training.Dockerfile by @nicolaven in #671
- utility to dump details of all nodes in a cluster, into a csv file by @amitosaurus in #652
- Update setup_mariadb_accounting.sh with apg installation by @amanshanbhag in #672
- U 2204 patch -- update from #672 by @amanshanbhag in #673
- Upgrade pinned version of Ansible by @amanshanbhag in #681
- Nghtm patch 2 by @nghtm in #683
- Fix minor spelling mistake in start_slurm.sh by @sammyhori in #686
- Fix nvidia container toolkit to 1.17.6 by @mhuguesaws in #689
- Update 2.SageMakerVPC.yaml by @nghtm in #691
- Skip fsx_ubuntu.sh execution when no FSx parameters are provided in the provisioning parameters by @vaikor-amazon in #692
- Change nccl-tests to have cuda version by @mhuguesaws in #694
- Adding a template for HyperPod EventBridge email notifications by @shimomut in #687
- Improvements/nccl cuda verison bump by @mhuguesaws in #695
- ec2 get metadata replacement by @gmgtamz in #515
- Replacing ********* with localhost in OZFS mount script by @amanshanbhag in #696
- Adding ssh keys to additional (OZFS at
/home) file system by @amanshanbhag in #700 - [feat]: Add describe alarm permissions in the execution role for Rolling Update Autorollback. by @divincode in #698
- fsdp k8s yaml to use c10d rdzv backend instead of etcd, updated readm… by @mvinci12 in #701
- Fixing Race Conditions reported in #674 by @amanshanbhag in #703
- feat: Add LoRA fine-tuning optimum-neuron example for slurm by @Captainia in #643
- Fsdp regression tests by @amanshanbhag in #714
- Fix FSDP venv creation by @mhuguesaws in #720
- Updating venv test case for FSDP to point to correct
train.pyby @amanshanbhag in #725 - Bump requests from 2.32.0 to 2.32.4 in /3.test_cases/pytorch/bionemo by @dependabot[bot] in #727
- new commit for fixing fsdp dataset, using allenai/c4 with HF token by @mvinci12 in #729
- Adding test configs to matrix by @amanshanbhag in #731
- Change FSDP steps and checkpoint steps by @mhuguesaws in #730
- Incorrect indent in container reg test by @amanshanbhag in #732
- Change FSDP steps to reduce time by @mhuguesaws in #734
- Adding SMHP test cluster to matrix (venv) by @amanshanbhag in #740
- Fixing path to match readme instructions by @amanshanbhag in #742
- Feat/picotron resume from checkpoint by @KeitaW in #656
- Fix FSDP venv run by @mhuguesaws in #733
- slurm and eks readme edits by @mvinci12 in #735
- Change FSDP PyTorch to 2.7.1 by @mhuguesaws in #739
- Change FSDP to truncate dataset by @mhuguesaws in #743
- fix typo in NCCL tests README by @KeitaW in #746
- Enable 1click for SageMaker HyperPod by @mhuguesaws in #670
- Fix FSDP requirements.txt to effectively use cuda 128 by @mhuguesaws in #748
- Terraform Modules Updates by @bluecrayon52 in #744
- HyperPod EKS Helper Script Fixes by @bluecrayon52 in #709
- Observability change target scrapping rate to 1 minute by @mhuguesaws in #750
- Fix FSDP destroy process group by @mhuguesaws in #749
- docker library version on eks by @mvinci12 in #753
- Add GPU Health, Slurm exporter to 1click observability by @mhuguesaws in #751
- Add DCGM exporter dashboard with hostnames by @mhuguesaws in #752
- adding llamav3 support on slurm and EKS by @allela-roy in #737
- updating FSDP slurm documentation by @allela-roy in #745
- Updating Parallelcluster deployment guide by @KeitaW in #721
- Update README.md by @nghtm in https://github.com/aws-samples/awsome-distribu...
Release before the mass migration work
This release is pointing out the old directory structure + test cases.
This release creates a new "opt-in" openZFS filesystem as a home-directory on SageMaker HyperPod Slurm clusters, to address the Lots of Small Files (LoSF) issue encountered frequently when creating Conda Environments on default home directories where Lustre exists.