This directory provides a script and recipe to train the UNet2D and UNet3D models to achieve state of the art accuracy. It also contains scripts to run inference on the UNet2D and UNet3D models on Intel® Gaudi® AI accelerator. These scripts are tested and maintained by Intel Gaudi. For further information on performance, refer to Intel Gaudi Model Performance Data page. Before you get started, make sure to review the Supported Configurations.
For further information on training deep learning models using Gaudi, refer to developer.habana.ai.
- Model-References
- Model Overview
- Setup
- Media Loading Acceleration
- Training Examples
- Pre-trained Checkpoint
- Inference Examples
- Accuracy Evaluation
- Advanced
- Supported Configurations
- Changelog
- Known Issues
The supported UNet2D and UNet3D are based on PyTorch Lightning. The PyTorch Lightning implementations are based on an earlier implementation from NVIDIA's nnUNet. Gaudi support is enabled with PyTorch Lightning version 1.7.7, which is installed along with the release dockers. For further details on the changes applied to the original model, refer to Training Script Modifications.
The following are the demos included in this release:
- For UNet2D, Lazy mode training for BS64 with FP32 & BF16 mixed precision.
- For UNet3D, Lazy mode training for BS2 with FP32 & BF16 mixed precision.
- For UNet2D, inference for BS64 with FP32 & BF16 mixed precision.
- For UNet3D, inference for BS2 with FP32 & BF16 mixed precision.
Please follow the instructions provided in the Gaudi Installation Guide
to set up the environment including the $PYTHON
environment variable. To achieve the best performance, please follow the methods outlined in the Optimizing Training Platform Guide.
The guides will walk you through the process of setting up your system to run the model on Gaudi.
In the docker container, clone this repository and switch to the branch that matches your Intel Gaudi software version. You can run the hl-smi
utility to determine the Intel Gaudi software version.
git clone -b [Intel Gaudi software version] https://github.com/HabanaAI/Model-References
NOTE: If the repository is not in the PYTHONPATH, make sure you update it:
export PYTHONPATH=/path/to/Model-References:$PYTHONPATH
- Go to PyTorch UNet directory:
cd Model-References/PyTorch/computer_vision/segmentation/Unet
- Install the required packages: On Ubuntu20.04
pip install -r ./requirements.txt
On Ubuntu22.04
pip install -r ./requirements_u22.txt
- Create a /data directory if not present:
mkdir /data
- Download the dataset:
$PYTHON download.py --task 01
NOTE: The script downloads the dataset in /data directory by default.
- To pre-process the dataset for UNet2D, run:
$PYTHON preprocess.py --task 01 --dim 2 --results /data/pytorch/unet/
$PYTHON preprocess.py --task 01 --dim 2 --exec_mode val --results /data/pytorch/unet/
$PYTHON preprocess.py --task 01 --dim 2 --exec_mode test --results /data/pytorch/unet/
- To process the dataset for UNet3D, run:
$PYTHON preprocess.py --task 01 --dim 3 --results /data/pytorch/unet/
$PYTHON preprocess.py --task 01 --dim 3 --exec_mode val --results /data/pytorch/unet/
$PYTHON preprocess.py --task 01 --dim 3 --exec_mode test --results /data/pytorch/unet/
NOTE: The script pre-processes the dataset downloaded in the above steps from /data
directory and based on top of results directory it creates 01_2d
directory for UNet2D and 01_3d
directory for UNet3D model inside /data
directory. Consequently, the dataset is available at /data/pytorch/unet/01_2d
directory for UNet2D and /data/pytorch/unet/01_3d
directory for UNet3D.
Gaudi 2 offers a dedicated hardware engine for Media Loading operations. For more details, please refer to Intel Gaudi Media Loader.
NOTE: The training examples are applicable for first-gen Gaudi and Gaudi 2 with torch.compile mode. When using Eager mode, replace the --use_torch_compile
with --run-lazy-mode=False
in the examples below.
export PT_HPU_LAZY_MODE=0
mkdir -p /tmp/Unet/results/fold_0
Run training on 1 HPU:
NOTE: The following commands use PyTorch Lightning by default. To use media loader on Gaudi 2, add --habana_loader
to the run commands.
- UNet2D in torch.compile mode, BF16 mixed precision, batch size 64, fold 0:
$PYTHON -u main.py --results /tmp/Unet/results/fold_0 --task 01 \
--logname res_log --fold 0 --hpus 1 --gpus 0 --data /data/pytorch/unet/01_2d \
--seed 1 --num_workers 8 --affinity disabled --norm instance --dim 2 \
--optimizer fusedadamw --exec_mode train --learning_rate 0.001 --autocast \
--deep_supervision --batch_size 64 --val_batch_size 64 --use_torch_compile
- UNet2D in torch.compile mode, BF16 mixed precision, batch size 64, benchmarking:
$PYTHON -u main.py --results /tmp/Unet/results/fold_0 --task 1 --logname res_log \
--fold 0 --hpus 1 --gpus 0 --data /data/pytorch/unet/01_2d --seed 123 \
--num_workers 1 --affinity disabled --norm instance --dim 2 --optimizer fusedadamw \
--exec_mode train --learning_rate 0.001 --autocast --batch_size 64 \
--val_batch_size 64 --benchmark --min_epochs 1 --max_epochs 2 --train_batches 150 --test_batches 150 --use_torch_compile
- UNet3D in torch.compile mode, BF16 mixed precision, batch size 2, fold 0:
$PYTHON -u main.py --results /tmp/Unet/results/fold_0 --task 01 --logname res_log \
--fold 0 --hpus 1 --gpus 0 --data /data/pytorch/unet/01_3d --seed 1 --num_workers 8 \
--affinity disabled --norm instance --dim 3 --optimizer fusedadamw \
--exec_mode train --learning_rate 0.001 --autocast --deep_supervision --batch_size 2 --val_batch_size 2 --use_torch_compile
- UNet3D in torch.compile mode, BF16 mixed precision, batch size 2, benchmarking:
$PYTHON -u main.py --results /tmp/Unet/results/fold_0 --task 1 --logname res_log \
--fold 0 --hpus 1 --gpus 0 --data /data/pytorch/unet/01_3d --seed 1 --num_workers 1 \
--affinity disabled --norm instance --dim 3 --optimizer fusedadamw \
--exec_mode train --learning_rate 0.001 --autocast --batch_size 2 \
--val_batch_size 2 --benchmark --min_epochs 1 --max_epochs 2 --train_batches 150 --test_batches 150 --use_torch_compile
Run training on 8 HPUs:
NOTE: The following commands use PyTorch Lightning by default. To use media loader on Gaudi 2, add --habana_loader
to the run commands.
To run multi-card demo, make sure to set the following prior to the training:
- The host machine has 512 GB of RAM installed.
- The docker is installed and set up as per the Gaudi Setup and Installation Guide, so that the docker has access to all 8 cards required for multi-card demo. Multi-card configuration for UNet2D and UNet3D training up to 1 server, with 8 Gaudi/Gaudi 2 cards, has been verified.
- All server network interfaces are up. You can change the state of each network interface managed by the
habanalabs
driver by running the following command:sudo ip link set <interface_name> up
NOTE: To identify if a specific network interface is managed by the habanalabs
driver type, run:
sudo ethtool -i <interface_name>
- UNet2D in torch.compile mode, BF16 mixed precision, batch size 64, world-size 8, fold 0:
$PYTHON -u main.py --results /tmp/Unet/results/fold_0 --task 1 --logname res_log \
--fold 0 --hpus 8 --gpus 0 --data /data/pytorch/unet/01_2d --seed 123 --num_workers 8 \
--affinity disabled --norm instance --dim 2 --optimizer fusedadamw --exec_mode train \
--learning_rate 0.001 --autocast --deep_supervision --batch_size 64 \
--val_batch_size 64 --min_epochs 30 --max_epochs 10000 --train_batches 0 --test_batches 0 --use_torch_compile
- UNet2D in torch.compile mode, BF16 mixed precision, batch size 64, world-size 8, benchmarking:
$PYTHON -u main.py --results /tmp/Unet/results/fold_0 --task 1 --logname res_log \
--fold 0 --hpus 8 --gpus 0 --data /data/pytorch/unet/01_2d --seed 123 --num_workers 1 \
--affinity disabled --norm instance --dim 2 --optimizer fusedadamw --exec_mode train \
--learning_rate 0.001 --autocast --batch_size 64 \
--val_batch_size 64 --benchmark --min_epochs 1 --max_epochs 2 --train_batches 150 --test_batches 150 --use_torch_compile
- UNet3D in torch.compile mode, BF16 mixed precision, Batch Size 2, world-size 8
$PYTHON -u main.py --results /tmp/Unet/results/fold_0 --task 01 --logname res_log \
--fold 0 --hpus 8 --gpus 0 --data /data/pytorch/unet/01_3d --seed 1 --num_workers 8 \
--affinity disabled --norm instance --dim 3 --optimizer fusedadamw --exec_mode train \
--learning_rate 0.001 --autocast --deep_supervision --batch_size 2 --val_batch_size 2 --use_torch_compile
- UNet3D in torch.compile mode, BF16 mixed precision, batch size 2, world-size 8, benchmarking:
$PYTHON -u main.py --results /tmp/Unet/results/fold_0 --task 1 --logname res_log --fold 0 \
--hpus 8 --gpus 0 --data /data/pytorch/unet/01_2d --seed 123 --num_workers 1 \
--affinity disabled --norm instance --dim 3 --optimizer fusedadamw --exec_mode train \
--learning_rate 0.001 --autocast --batch_size 2 \
--val_batch_size 2 --benchmark --min_epochs 1 --max_epochs 2 --train_batches 150 --test_batches 150 --use_torch_compile
To run the inference example, a pre-trained checkpoint is required. Intel Gaudi provides UNet2D and UNet3D checkpoints pre-trained on Gaudi. For example, the relevant checkpoint for UNet2D can be downloaded from UNet2D Catalog. The relevant checkpoint for UNet3D can be downloaded from UNet3D Catalog.
cd Model-References/PyTorch/computer_vision/segmentation/Unet
mkdir pretrained_checkpoint
wget </url/of/pretrained_checkpoint.tar.gz>
tar -xvf <pretrained_checkpoint.tar.gz> -C pretrained_checkpoint && rm <pretrained_checkpoint.tar.gz>
The following commands assume that:
- Pre-processed dataset is available at
/data/pytorch/unet/
directory. Alternative location for the dataset can be specified using the--data
argument. - Pre-trained checkpoint is available at
pretrained_checkpoint/pretrained_checkpoint.pt
. Alternative file name for the pretrained checkpoint can be specified using the--ckpt_path
argument.
NOTE: The following commands use PyTorch Lightning by default. To use media loader on Gaudi 2, add --habana_loader
to the run commands. Default --measurement_type
is throughput
to get perf but to get actual latency add --measurement_type latency
to below run commands. When using Eager mode, replace the --use_torch_compile
with --run-lazy-mode=False
in the examples below.
export PT_HPU_LAZY_MODE=0
mkdir -p /tmp/Unet/results/fold_3
Run inference on 1 HPU:
Benchmark Inference
- UNet2D in torch.compile mode, BF16 mixed precision, batch Size 64:
$PYTHON main.py --exec_mode predict --task 01 --hpus 1 --fold 3 --val_batch_size 64 --dim 2 --data=/data/pytorch/unet/01_2d --results=/tmp/Unet/results/fold_3 --autocast --benchmark --test_batches 150 --use_torch_compile
- UNet2D in torch.compile mode, FP32 precision, batch size 64:
$PYTHON main.py --exec_mode predict --task 01 --hpus 1 --fold 3 --val_batch_size 64 --dim 2 --data=/data/pytorch/unet/01_2d --results=/tmp/Unet/results/fold_3 --benchmark --test_batches 150 --use_torch_compile
- UNet3D in torch.compile mode, BF16 mixed precision, batch size 2:
$PYTHON main.py --exec_mode predict --task 01 --hpus 1 --fold 3 --val_batch_size 2 --dim 3 --data=/data/pytorch/unet/01_3d --results=/tmp/Unet/results/fold_3 --autocast --benchmark --test_batches 150 --use_torch_compile
- UNet3D in torch.compile mode, FP32 precision, batch size 2:
$PYTHON main.py --exec_mode predict --task 01 --hpus 1 --fold 3 --val_batch_size 2 --dim 3 --data=/data/pytorch/unet/01_3d --results=/tmp/Unet/results/fold_3 --benchmark --test_batches 150 --use_torch_compile
Inference
- UNet2D in torch.compile mode, BF16 mixed precision, batch size 64:
$PYTHON main.py --exec_mode predict --task 01 --hpus 1 --fold 3 --seed 123 --val_batch_size 64 --dim 2 --data=/data/pytorch/unet/01_2d --results=/tmp/Unet/results/fold_3 --autocast --ckpt_path pretrained_checkpoint/pretrained_checkpoint.pt --use_torch_compile
- UNet2D in torch.compile mode, FP32 precision, batch size 64:
$PYTHON main.py --exec_mode predict --task 01 --hpus 1 --fold 3 --seed 123 --val_batch_size 64 --dim 2 --data=/data/pytorch/unet/01_2d --results=/tmp/Unet/results/fold_3 --ckpt_path pretrained_checkpoint/pretrained_checkpoint.pt --use_torch_compile
- UNet3D in torch.compile mode, BF16 mixed precision, batch size 2:
$PYTHON main.py --exec_mode predict --task 01 --hpus 1 --fold 3 --seed 123 --val_batch_size 2 --dim 3 --data=/data/pytorch/unet/01_3d --results=/tmp/Unet/results/fold_3 --autocast --ckpt_path pretrained_checkpoint/pretrained_checkpoint.pt --use_torch_compile
- UNet3D in torch.compile mode, FP32 precision, batch size 2:
$PYTHON main.py --exec_mode predict --task 01 --hpus 1 --fold 3 --seed 123 --val_batch_size 2 --dim 3 --data=/data/pytorch/unet/01_3d --results=/tmp/Unet/results/fold_3 --ckpt_path pretrained_checkpoint/pretrained_checkpoint.pt --use_torch_compile
export PT_HPU_LAZY_MODE=0
mkdir -p /tmp/Unet/results/fold_3
NOTE: The following commands use PyTorch Lightning by default. To use media loader on Gaudi 2, add --habana_loader
to the run commands. When using Eager mode, replace the --use_torch_compile
with --run-lazy-mode=False
in the examples below.
- UNet2D in torch.compile mode, FP32 mixed precision, batch size 64:
$PYTHON main.py --exec_mode=evaluate --data=/data/pytorch/unet/01_2d --hpus=1 --fold=3 --seed 123 --batch_size=64 --val_batch_size=64 --task=01 --dim=2 --results=/tmp/Unet/results/fold_3 --ckpt_path pretrained_checkpoint/pretrained_checkpoint.pt --use_torch_compile
- UNet2D in torch.compile mode, BF16 mixed precision, batch size 64:
$PYTHON main.py --exec_mode=evaluate --data=/data/pytorch/unet/01_2d --hpus=1 --fold=3 --seed 123 --batch_size=64 --val_batch_size=64 --autocast --task=01 --dim=2 --results=/tmp/Unet/results/fold_3 --ckpt_path pretrained_checkpoint/pretrained_checkpoint.pt --use_torch_compile
- UNet3D in torch.compile mode, FP32 precision, batch size 2:
$PYTHON main.py --exec_mode=evaluate --data=/data/pytorch/unet/01_3d/ --hpus=1 --fold=3 --seed 123 --batch_size=2 --val_batch_size=2 --task=01 --dim=3 --results=/tmp/Unet/results/fold_3 --ckpt_path pretrained_checkpoint/pretrained_checkpoint.pt --use_torch_compile
- UNet3D in torch.compile mode, BF16 precision, batch size 2:
$PYTHON main.py --exec_mode=evaluate --data=/data/pytorch/unet/01_3d/ --hpus=1 --fold=3 --seed 123 --batch_size=2 --val_batch_size=2 --autocast --task=01 --dim=3 --results=/tmp/Unet/results/fold_3 --ckpt_path pretrained_checkpoint/pretrained_checkpoint.pt --use_torch_compile
- The above Inference commands can be used with
--save_preds
and predictions will be saved in a folder. - Using above saved predictions and target labels folder as shown in the below command to get accuracy.
$PYTHON evaluate.py --preds <prediction_results_path> --lbls <labels_path>
To see the available training parameters, run the following command:
$PYTHON -u main.py --help
UNet2D and UNet3D 1x card
Validated on | Intel Gaudi Software Version | PyTorch Lightning Version | Mode |
---|---|---|---|
Gaudi | 1.18.0 | 2.3.3 | Training |
Gaudi 2 | 1.18.0 | 2.3.3 | Training |
Gaudi | 1.18.0 | 2.3.3 | Inference |
Gaudi 2 | 1.18.0 | 2.3.3 | Inference |
UNet2D and UNet3D 8x cards
Validated on | Intel Gaudi Software Version | PyTorch Lightning Version | Mode |
---|---|---|---|
Gaudi | 1.18.0 | 2.3.3 | Training |
Gaudi 2 | 1.18.0 | 2.3.3 | Training |
- Added support for torch.compile and Eager mode inference.
- Added support for torch.compile and Eager mode training.
- Upgraded dali dataloader package "nvidia-dali-cuda110" to 1.32.0.
- Added support for Gaudi on Ubuntu22.04.
- Enabled using HPU Graphs by default.
- Added option to enable HPU Graphs in training via
--hpu_graphs
flag.
- Removed HMP, switched to autocast.
- Eager mode support is deprecated.
- Dynamic shapes will be enabled by default in future releases. It is currently enabled in training script as a temporary solution.
- UNet2D/3D training using native PyTorch scripts (without PyTorch Lightning) is deprecated.
- Enabled dynamic shapes.
- Enabled HPUProfiler using habana-lightning-plugins.
- Disabled dynamic shapes.
- Upgraded pytorch-lightning to 1.9.4 version.
- Enabled usage of PyTorch autocast.
- Initial release for inference support on UNet3D.
- Removed support for Gaudi on Ubuntu22.04.
- Refactored code to support on Ubuntu22.04 without DALI dataloader on Gaudi 2.
- Installation instructions are different for Ubuntu20.04 and Ubuntu22.04.
- HPU Graphs is the default inference mode.
- Removed newly added scripts to support inference.
- Inference is supported through existing scripts only.
- Initial release for inference support on UNet2D
- Updated script to make use of TQDM progress bar to override progressbar refresh rate.
- Upgraded Unet to work with pytorch-lightning 1.7.7.
- Removed mark_step handling in script as it is taken care in pytorch lightning plugins.
- Added
optimizer_zero_grad
hook and changedprogress_bar_refresh_rate
to improve performance. - Added support for 1 and 8 card training on Gaudi 2.
- Added PyTorch support (without PyTorch Lightning) for single Gaudi device with a new flag (
--framework pytorch
) in the run command.
- Changes done to use vanilla PyTorch Lightning 1.6.4 which includes HPU device support.
- Removed support for channels last format.
- Weights and other dependent parameters need not be permuted anymore.
- Default execution mode modified to Lazy mode.
- All ops in validation are executed on HPU.
- Changes to improve time-to-train for UNet3D.
- Removed support for specifying frequency of validation.
- Bucket size has been increased to 125MB.
- Enabled HCCL flow for distributed training.
The following are the changes made to the training scripts:
-
Added support for Gaudi devices:
- Loading Intel Gaudi specific library.
- Certain environment variables are defined for Gaudi.
- Added support to run training in Lazy mode in addition to the Eager mode.
mark_step()
is performed to trigger execution.- Changes to enable scripts on PyTorch Lightning 1.4.0 as base scripts used older version of PyTorch Lightning.
- Added support to use HPU accelerator plugin, DDP plugin(for multi-card training) and mixed precision plugin provided with the installed PyTorch Lightning package.
-
Improved performance:
- Optimized FusedAdamW operator is used in place of torch.optim.AdamW.
- Added dice.py with code from monai package and replaced slice with split operator in the forward method.
- Added monai_sliding_window_inference.py with code from monai package and modified to avoid recomputation of importance map every iteration.
- Changes to configure the gradient reduction bucket size, set gradients as bucket for all-reduce use static graphs for multi-HPU training.
- Changed progress_bar_refresh_rate while instantiating Trainer as a workaround for Lightning-AI/pytorch-lightning#13179.
-
Changes to run DALI dataloader on CPU & make data-loading deterministic.
-
Metric was copied to
pl_metric.py
from older version of PyTorch Lightning(1.0.4). Implementation in PyTorch Lightning 1.4.0(torch.metric) is different and incompatible. -
PyTorch Lightning metrics is deprecated since PyTorch Lightning 1.3 and suggested to change to torchmetrics. Since
stat_scores
implementation is different and incompatible, older version was copied here from PyTorch Lightning 1.0. -
As a workaround for NVIDIA/DALI#3865, validation loss is not computed in odd epochs. Other validation metrics are computed every epoch. All metrics are logged only for even epochs.
-
Added HPU Graphs support to reduce latency for inference.
- Placing mark_step() arbitrarily may lead to undefined behavior. Recommend to keep mark_step() as shown in provided scripts.
- Only scripts & configurations mentioned in this README are supported and verified.