Skip to content
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 0 additions & 9 deletions 3.test_cases/pytorch/cpu-ddp/README.md

This file was deleted.

124 changes: 0 additions & 124 deletions 3.test_cases/pytorch/cpu-ddp/ddp.py

This file was deleted.

118 changes: 0 additions & 118 deletions 3.test_cases/pytorch/cpu-ddp/kubernetes/fsdp-simple.yaml

This file was deleted.

17 changes: 0 additions & 17 deletions 3.test_cases/pytorch/cpu-ddp/slurm/0.create-conda-env.sh

This file was deleted.

26 changes: 0 additions & 26 deletions 3.test_cases/pytorch/cpu-ddp/slurm/1.conda-train.sbatch

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
Miniconda3-latest*
miniconda3
pt_cpu
pt
*.yaml
data
*.pt
mlruns
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@ FROM pytorch/pytorch:latest

RUN apt update && apt upgrade -y

RUN mlflow==2.13.2 sagemaker-mlflow==0.1.0
COPY ddp.py /workspace


71 changes: 71 additions & 0 deletions 3.test_cases/pytorch/ddp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# PyTorch DDP <!-- omit in toc -->

Isolated environments are crucial for reproducible machine learning because they encapsulate specific software versions and dependencies, ensuring models are consistently retrainable, shareable, and deployable without compatibility issues.

[Anaconda](https://www.anaconda.com/) leverages conda environments to create distinct spaces for projects, allowing different Python versions and libraries to coexist without conflicts by isolating updates to their respective environments. [Docker](https://www.docker.com/), a containerization platform, packages applications and their dependencies into containers, ensuring they run seamlessly across any Linux server by providing OS-level virtualization and encapsulating the entire runtime environment.

This example showcases [PyTorch DDP](https://pytorch.org/tutorials/beginner/ddp_series_theory.html) environment setup utilizing these approaches for efficient environment management. The implementation supports both CPU and GPU computation:

- **CPU Training**: Uses the GLOO backend for distributed training on CPU nodes
- **GPU Training**: Automatically switches to NCCL backend when GPUs are available, providing optimized multi-GPU training

## Training

### Basic Usage

To run the training with GPUs, use `torchrun` with the appropriate number of GPUs:
```bash
torchrun --nproc_per_node=N ddp.py --total_epochs=10 --save_every=1 --batch_size=32
```
where N is the number of GPUs you want to use.

## MLFlow Integration

This implementation includes [MLFlow](https://mlflow.org/) integration for experiment tracking and model management. MLFlow helps you track metrics, parameters, and artifacts during training, making it easier to compare different runs and manage model versions.

### Setup

1. Install MLFlow:
```bash
pip install mlflow
```

2. Start the MLFlow tracking server:
```bash
mlflow ui
```

### Usage

To enable MLFlow logging, add the `--use_mlflow` flag when running the training script:
```bash
torchrun --nproc_per_node=N ddp.py --total_epochs=10 --save_every=1 --batch_size=32 --use_mlflow
```

By default, MLFlow will connect to `http://localhost:5000`. To use a different tracking server, specify the `--tracking_uri`:
```bash
torchrun --nproc_per_node=N ddp.py --total_epochs=10 --save_every=1 --batch_size=32 --use_mlflow --tracking_uri=http://localhost:5000
```

### What's Tracked

MLFlow will track:
- Training metrics (loss per epoch)
- Model hyperparameters
- Model checkpoints
- Training configuration

### Viewing Results

1. Open your browser and navigate to `http://localhost:5000` (or your specified tracking URI)

The MLFlow UI provides:
- Experiment comparison
- Metric visualization
- Parameter tracking
- Model artifact management
- Run history

## Deployment

We provide guides for both Slurm and Kubernetes. However, please note that the Conda example is only compatible with Slurm. For detailed instructions, proceed to the [slurm](slurm) or [kubernetes](kubernetes) subdirectory.
Loading