aws-samples · KeitaW · Apr 28, 2025 · Apr 28, 2025 · Apr 28, 2025 · Apr 28, 2025
diff --git a/3.test_cases/pytorch/cpu-ddp/README.md b/3.test_cases/pytorch/cpu-ddp/README.md
diff --git a/3.test_cases/pytorch/cpu-ddp/ddp.py b/3.test_cases/pytorch/cpu-ddp/ddp.py
diff --git a/3.test_cases/pytorch/cpu-ddp/kubernetes/fsdp-simple.yaml b/3.test_cases/pytorch/cpu-ddp/kubernetes/fsdp-simple.yaml
diff --git a/3.test_cases/pytorch/cpu-ddp/slurm/0.create-conda-env.sh b/3.test_cases/pytorch/cpu-ddp/slurm/0.create-conda-env.sh
diff --git a/3.test_cases/pytorch/cpu-ddp/slurm/1.conda-train.sbatch b/3.test_cases/pytorch/cpu-ddp/slurm/1.conda-train.sbatch
diff --git a/3.test_cases/pytorch/cpu-ddp/.gitignore → 3.test_cases/pytorch/ddp/.gitignore b/3.test_cases/pytorch/cpu-ddp/.gitignore → 3.test_cases/pytorch/ddp/.gitignore
@@ -1,4 +1,7 @@
 Miniconda3-latest*
 miniconda3
-pt_cpu
+pt
 *.yaml
+data
+*.pt
+mlruns
diff --git a/3.test_cases/pytorch/cpu-ddp/Dockerfile → 3.test_cases/pytorch/ddp/Dockerfile b/3.test_cases/pytorch/cpu-ddp/Dockerfile → 3.test_cases/pytorch/ddp/Dockerfile
@@ -2,6 +2,6 @@ FROM pytorch/pytorch:latest
 
 RUN apt update && apt upgrade -y
 
+RUN mlflow==2.13.2 sagemaker-mlflow==0.1.0
 COPY ddp.py /workspace
 
-
diff --git a/3.test_cases/pytorch/ddp/README.md b/3.test_cases/pytorch/ddp/README.md
@@ -0,0 +1,71 @@
+# PyTorch DDP <!-- omit in toc -->
+
+Isolated environments are crucial for reproducible machine learning because they encapsulate specific software versions and dependencies, ensuring models are consistently retrainable, shareable, and deployable without compatibility issues.
+
+[Anaconda](https://www.anaconda.com/) leverages conda environments to create distinct spaces for projects, allowing different Python versions and libraries to coexist without conflicts by isolating updates to their respective environments. [Docker](https://www.docker.com/), a containerization platform, packages applications and their dependencies into containers, ensuring they run seamlessly across any Linux server by providing OS-level virtualization and encapsulating the entire runtime environment.
+
+This example showcases [PyTorch DDP](https://pytorch.org/tutorials/beginner/ddp_series_theory.html) environment setup utilizing these approaches for efficient environment management. The implementation supports both CPU and GPU computation:
+
+- **CPU Training**: Uses the GLOO backend for distributed training on CPU nodes
+- **GPU Training**: Automatically switches to NCCL backend when GPUs are available, providing optimized multi-GPU training
+
+## Training
+
+### Basic Usage
+
+To run the training with GPUs, use `torchrun` with the appropriate number of GPUs:
+```bash
+torchrun --nproc_per_node=N ddp.py --total_epochs=10 --save_every=1 --batch_size=32
+```
+where N is the number of GPUs you want to use.
+
+## MLFlow Integration
+
+This implementation includes [MLFlow](https://mlflow.org/) integration for experiment tracking and model management. MLFlow helps you track metrics, parameters, and artifacts during training, making it easier to compare different runs and manage model versions.
+
+### Setup
+
+1. Install MLFlow:
+```bash
+pip install mlflow
+```
+
+2. Start the MLFlow tracking server:
+```bash
+mlflow ui
+```
+
+### Usage
+
+To enable MLFlow logging, add the `--use_mlflow` flag when running the training script:
+```bash
+torchrun --nproc_per_node=N ddp.py --total_epochs=10 --save_every=1 --batch_size=32 --use_mlflow
+```
+
+By default, MLFlow will connect to `http://localhost:5000`. To use a different tracking server, specify the `--tracking_uri`:
+```bash
+torchrun --nproc_per_node=N ddp.py --total_epochs=10 --save_every=1 --batch_size=32 --use_mlflow --tracking_uri=http://localhost:5000
+```
+
+### What's Tracked
+
+MLFlow will track:
+- Training metrics (loss per epoch)
+- Model hyperparameters
+- Model checkpoints
+- Training configuration
+
+### Viewing Results
+
+1. Open your browser and navigate to `http://localhost:5000` (or your specified tracking URI)
+
+The MLFlow UI provides:
+- Experiment comparison
+- Metric visualization
+- Parameter tracking
+- Model artifact management
+- Run history
+
+## Deployment
+
+We provide guides for both Slurm and Kubernetes. However, please note that the Conda example is only compatible with Slurm. For detailed instructions, proceed to the [slurm](slurm) or [kubernetes](kubernetes) subdirectory.
Original file line number	Diff line number	Diff line change
Expand Up		@@ -2,6 +2,6 @@ FROM pytorch/pytorch:latest

		RUN apt update && apt upgrade -y

		RUN mlflow==2.13.2 sagemaker-mlflow==0.1.0
		COPY ddp.py /workspace