We will follow MosaicML's stable diffusion benchmarking scripts provided here. It uses the 'stabilityai/stable-diffusion-2-base'
model. You can check the number of parameters by executing:
python3 calculate_number_of_parameters.py
Model has 1289.952427 M parameters and 865.910724 M trainable_params
To simplify testing, we have separate scripts for Single node and Multi node Distributed Training. We will also present a comparison of throughput (images/second) achieved with P4de (A100 80GB) and P5 (H100 80GB) instances.
You can export the following environment variables:
export PYTHON_VERSION=3.10
# We are using Python version 3.10 in this work. For a different Python version select the right Miniconda file from https://repo.anaconda.com/miniconda/
export MINICONDA_INSTALLER=Miniconda3-py310_23.5.2-0-Linux-x86_64
export CUDA_VERSION=12.1
export MOSAICML_VERSION=0.15.0
export PYTORCH_INDEX_URL=https://download.pytorch.org/whl/nightly/cu121
export PYTORCH_IMAGE=nvcr.io/nvidia/pytorch:23.08-py3
export DOCKER_IMAGE_NAME=mosaicml-stable-diffusion
export TAG=$MOSAICML_VERSION
For the single node training use case, we provide both a Conda environment and a Dockerfile for your convenience and preference. However, you do not need both for setting up training.
To install Miniconda, please follow the steps below:
# install in the shared directory so compute nodes can source the environment:
cd /apps
# Get the appropriate Miniconda_version from https://repo.anaconda.com/miniconda/
wget -O miniconda.sh "https://repo.anaconda.com/miniconda/${MINICONDA_INSTALLER}.sh" \
&& bash miniconda.sh -b -p ./.conda \
&& ./.conda/bin/conda init bash
# Detect the OS based on /etc/os-release
os=$(grep '^ID=' /etc/os-release | awk -F= '{print $2}' | tr -d '"')
if [[ "$os" == "amzn" ]]; then
source /home/ec2-user/.bashrc
elif [[ "$os" == "ubuntu" ]]; then
source /home/ubuntu/.bashrc
else
echo "Unknown OS: $os"
fi
conda create -n pt-nightlies python=${PYTHON_VERSION}
conda activate pt-nightlies
# Install PyTorch Nightly distribution with specified Cuda version
pip3 install --pre torch torchvision torchaudio --index-url ${PYTORCH_INDEX_URL}
# Install Diffusers and Transformers
pip3 install diffusers["torch"] transformers
# Install Weights and Biases
pip3 install wandb
# We will install Composer from source. First clone the Repo
git clone https://github.com/mosaicml/composer.git
When testing the latest version of MosaicML's Composer, we found that different ways to set up the environment with a PyTorch Nightly conda environment or a Nvidia PyTorch Docker container. For single or multi-node testing, you could use either to run distributed training. Next, we present both approaches.
The conda environment installs the PyTorch nightly distribution for Cuda 12.1 and the diffusers, transformers
and weights and biases libraries and also clones the Composer repository. Before installing Composer, we need to avoid some torch and torchvision version checks by commenting out those lines in composer/setup.py
so it looks like:
install_requires = [
'pyyaml>=6.0,<7',
'tqdm>=4.62.3,<5',
'torchmetrics>=0.10.0,<1.1',
'torch_optimizer>=0.3.0,<0.4',
'requests>=2.26.0,<3',
'numpy>=1.21.5,<1.27.0',
'psutil>=5.8.0,<6',
'coolname>=1.1.0,<3',
'tabulate==0.9.0', # for auto-generating tables
'py-cpuinfo>=8.0.0,<10',
'packaging>=21.3.0,<23',
'importlib-metadata>=5.0.0,<7',
'mosaicml-cli>=0.5.8,<0.6',
]
Once this change is done, you can install composer as pip3 install -e .
The single-node
folder also has the Dockerfile with commands to build the image and run the container. If you are opting to setup training with a Conda environment, then this setup is not needed. Run this setup only if you need to run MosaicML Composer from within a Nvidia PyTorch container.
cd awsome-distributed-training/3.test_cases/6.stable-diffusion/single-node
# build the image
docker build --build-arg MOSAICML_VERSION=${MOSAICML_VERSION} --build-arg PYTORCH_IMAGE=${PYTORCH_IMAGE} --build-arg PYTORCH_INDEX_URL=${PYTORCH_INDEX_URL} -t ${DOCKER_IMAGE_NAME}:${TAG} -f 0.Dockerfile .
# run container
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it ${DOCKER_IMAGE_NAME} /bin/bash
Once you are in the conda environment or the container, run the following to kickstart training. In all these tests, we are using synthetic data generated by diffusion-benchmark/data.py
composer benchmark.py --use_ema --use_synth_data --device_train_microbatch_size 4
And you should see training starts as:
Namespace(batch_size=2048, image_size=512, remote=None, local='/tmp/mds-cache/mds-laion-2/', use_synth_data=True, model_name='stabilityai/stable-diffusion-2-base', use_ema=True, wandb_name=None, wandb_project=None, device_train_microbatch_size=4)
wandb: Tracking run with wandb version 0.13.11
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
wandb: WARNING URL not available in offline run
******************************
Config:
enabled_algorithms/EMA: true
node_name: unknown because NODENAME environment variable not set
num_gpus_per_node: 8
num_nodes: 1
rank_zero_seed: 3179589898
******************************
train Epoch 0: 38%|█████████▍ | 18/48 [03:28<05:29, 10.99s/ba, loss/train/total=0.1463]
To see GPU utilization, start another shell on the EC2 instance and run either watch nvidia-smi
or nvidia-smi -l 2
if you get a segmentation error which happens frequently if you launched the EC2 instance with a DLAMI. You can also run nvidia-smi in docker as:
docker run --rm -it --gpus all nvidia/cuda:12.2.0-devel-ubuntu20.04 watch nvidia-smi
We ran tests on P4de (A100 80GB) and P5 (H100 80GB) machines and here is a comparison.
Micro BS | num_workers | EMA | P4de Throughput | P5 Throughput | Ratio |
---|---|---|---|---|---|
4 | 8 | Yes | 142.69 | 189.28 | 1.32x |
8 | 8 | Yes | 173.89 | 289.19 | 1.66x |
12 | 8 | Yes | 185.18 | 329.55 | 1.78x |
16 | 8 | Yes | 192.85 | 353.81 | 1.83x |
20 | 8 | Yes | 197.47 | 358.68 | 1.82x |
20 | 8 | No | 197.90 | 361.15 | 1.83x |
32 | 8 | Yes | 204.22 | 385.31 | 1.89x |
HuggingFace Diffusers has a set_attn_processor method that you can use to plug and play different attention processors. A list of attention processors can be found here
You can try a different Attention Processor like below:
from diffusers.models.attention_processor import AttnProcessor
self.unet = UNet2DConditionModel.from_pretrained(''stabilityai/stable-diffusion-2-base'', subfolder='unet')
self.unet.set_attn_processor(AttnProcessor())
AttnProcessor2_0
which is a Processor for implementing scaled dot-product attention is enabled by default if you're using PyTorch 2.0.
The older self.unet.set_attn_processor(AttnProcessor())
gives Cuda OOM error with a batch size of 32 while with AttnProcessor2_0()
is able to run with a batch size of 32 and yield 385 images/sec throughput
More details on this can be found here: https://pytorch.org/blog/accelerated-diffusers-pt-20/
For the multi-node training we've created a Dockerfile, and Slurm submit script to submit the training job. To get started please follow the guide AWS ParallelCluster Distributed Training. Before starting this section make sure you have the following setup:
- AWS ParallelCluster >= 3.7.0
- Pyxis
- Enroot
- FSx Lustre Filesystem
- To get started, clone this repo and cd into the multi-node directory:
git clone https://github.com/aws-samples/awsome-distributed-training.git
cd awsome-distributed-training/6.stable-diffusion/multi-node
docker build --build-arg MOSAICML_VERSION=${MOSAICML_VERSION} --build-arg PYTORCH_IMAGE=${PYTORCH_IMAGE} -t ${DOCKER_IMAGE_NAME}:${TAG} -f 1.Dockerfile .
Convert the Docker container image to an Enroot squash file that will be stored in /apps. This step takes a few minutes.
enroot import -o /apps/${DOCKER_IMAGE_NAME}.sqsh dockerd://${DOCKER_IMAGE_NAME}:${TAG}
sbatch 2.train.sbatch
Namespace(batch_size=2048, image_size=512, remote=None, local='/tmp/mds-cache/mds-laion-2/', use_synth_data=True, model_name='stabilityai/stable-diffusion-2-base', use_ema=True, wandb_name=None, wandb_project=None, device_train_microbatch_size=4)
wandb: Tracking run with wandb version 0.13.11
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
wandb: WARNING URL not available in offline run
******************************
Config:
enabled_algorithms/EMA: true
node_name: unknown because NODENAME environment variable not set
num_gpus_per_node: 8
num_nodes: 2
rank_zero_seed: 3179589898
******************************
rain Epoch 0: 100%|█████████████████████████| 48/48 [09:57<00:00, 12.45s/ba, loss/train/total=0.1521]
Next we will show how to train stable diffusion with Mosaic ML's composer on Amazon EKS. To start we have created an EKS cluster following the steps here. You can follow these steps to add a nodegroup of p5.48xlarge
instances. First export these environment variables.
export AWS_REGION=us-west-2
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
## Docker Image
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
export DOCKER_IMAGE_NAME=mosaicml-stable-diffusion
export MOSAICML_VERSION=0.15.0
export TAG=$MOSAICML_VERSION
export PYTORCH_IMAGE=nvcr.io/nvidia/pytorch:23.08-py3
## Job parameters
export MASTER_PORT=80
export NUM_NODES=64
export NUM_GPUS_PER_NODE=8
export WORLD_SIZE=$((NUM_NODES*NUM_GPUS_PER_NODE))
First we need to run the aws-do-eks
container which has all the necessary kubectl tools installed. Just run:
git clone https://github.com/aws-samples/aws-do-eks.git
cd ./aws-do-eks
# Build the aws-do-eks Docker image
./build.sh
# Run container
./run.sh
# Execute in the container
./exec.sh
cd /eks/impl/aws
# Next we will edit the nodegroup.conf config file
To add a managed P5 nodegroup, we will follow the steps listed in the aws-do-eks
project.
- First we need to create a P5 launch template and to do that we need to fill out the nodegroup.conf config
CLUSTER=<Your-EKS-Cluster-Name>
REGION=${AWS_REGION}
LAUNCH_TEMPLATE_NAME=lt-p5-odcr-eks-1-27
LAUNCH_TEMPLATE_ID=
LAUNCH_TEMPLATE_VERSION=1
NODEGROUP_NAME=p5-48xlarge
NODE_ROLE_ARN=<Copy Node IAM Role ARN from `sys` node in EKS console>
SUBNETS=<Private subnet id from the AZ your P5 nodes are in>
MIN_SIZE=0
DESIRED_SIZE=64
MAX_SIZE=64
EFA_VERSION=1.29.1
AMI=<eks-optimized-ami-id>
SSH_KEY_NAME=<ssh-key-name>
CAPACITY_RESERVATION_ID=<capacity reservation id of the P5 nodes>
PLACEMENT_GROUP_NAME=<Create a Placement Group from EC2 console>
You can get the EKS optimized id for EKS version 1.27 and Amazon Linux 2 as:
aws ssm get-parameter --name /aws/service/eks/optimized-ami/1.27/amazon-linux-2-gpu/recommended/image_id --region $AWS_REGION --query 'Parameter.Value' --output text
Next you can follow the steps given here to create a P5 nodegroup.
Once the nodes are created, you can use nv
to list the available nodes. nv
is an alias to eks-node-viewer
. You can see other aliases by typing alias
. Below is a sample output with a cluster with 2 c5.4xlarge
nodes. The status of Ready
means that the node has joined the cluster. If a node is in a Not Ready
state, you might need to manually terminate the node from EC2 console and EKS will restart it and the node will join the cluster again.
2 nodes (700m/31780m) 2.2% cpu █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ $1.360/hour | $992.800/month
13 pods (0 pending 13 running 13 bound)
ip-192-168-70-41.us-west-2.compute.internal cpu █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 2% (7 pods) c5.4xlarge/$0.6800 On-Demand - Ready
ip-192-168-120-65.us-west-2.compute.internal cpu █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 2% (6 pods) c5.4xlarge/$0.6800 On-Demand - Ready
•
←/→ page • q: quit
You can see additional details about the node as below.
kubectl describe node <node-ip-address-from-node-viewer>
The following is the Allocatable
section of a P5 node which shows that there are 8 GPUs available and 32 EFA devices available as well.
Allocatable:
cpu: 191450m
ephemeral-storage: 27356033509547
hugepages-1Gi: 0
hugepages-2Mi: 42242Mi
memory: 2052371068Ki
nvidia.com/gpu: 8
pods: 100
vpc.amazonaws.com/efa: 32
2.2.4 Next we need to build the Docker Image and push it to ECR:
docker build --build-arg MOSAICML_VERSION=${MOSAICML_VERSION} --build-arg PYTORCH_IMAGE=${PYTORCH_IMAGE} -t ${REGISTRY}${DOCKER_IMAGE_NAME}${TAG} -f 1.Dockerfile .
Before pushing the image you might need to login to ECR as:
aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY
And the push the image as:
# Create registry if it does not exist
REGISTRY_COUNT=$(aws ecr describe-repositories | grep ${DOCKER_IMAGE_NAME} | wc -l)
if [ "$REGISTRY_COUNT" == "0" ]; then
echo ""
echo "Creating repository ${DOCKER_IMAGE_NAME} ..."
aws ecr create-repository --repository-name ${DOCKER_IMAGE_NAME}
fi
echo "Pushing image ${REGISTRY}${DOCKER_IMAGE_NAME}:${TAG}"
docker image push ${REGISTRY}${DOCKER_IMAGE_NAME}:${TAG}
Kubeflow Training Operator is a Kubernetes-native project for fine-tuning and scalable distributed training allows you to use Kubernetes workloads to effectively train your large models via Kubernetes Custom Resources APIs or using Training Operator Python SDK. You can do that with ease from within the aws-do-eks
container and execute the following:
cd /eks/deployment/kubeflow/training-operator
./deploy.sh
We provide a template YAML file for submitting the stable diffusion distributed training job in 3.stable-diffusion-eks.yaml-template. You can substitute the environment variables in the template manifest as:
cat 3.mosaicml-sd-eks.yaml-template | envsubst > mosaicml-sd-eks.yaml
To submit the training job, you need to deploy the etcd
manifest first and then the training manifest.
kubectl apply -f ./etcd.yaml
kubectl apply -f ./mosaicml-sd-eks.yaml
Once the pods are created, you can use the following to monitor the job:
# To view all the pods
kubectl get pods -A
# To view logs from a pod
kubectl logs -f stable-diffusion-worker-0
To kill the training job:
kubectl delete -f mosaicml-sd-eks.yaml
We were able to do a scaling test till 64 P5 nodes. The following charts show performance improvemnt and raw throughput numbers as we scale to 64 nodes:
You can delete the P5 nodegroup either from the EKS console or from the CLI like below:
aws eks delete-nodegroup --cluster-name <Your-EKS-Cluster-Name> --nodegroup-name p5-48xlarge --output text