Repository containing scaffolding for a Python 3-based data science project that uses distributed, multi-gpu training with Horovod together with one of TensorFlow, PyTorch, or MXNET.
Simply follow the instructions to create a new project repository from this template.
Project organization is based on ideas from Good Enough Practices for Scientific Computing.
- Put each project in its own directory, which is named after the project.
- Put external scripts or compiled programs in the
bindirectory. - Put raw data and metadata in a
datadirectory. - Put text documents associated with the project in the
docdirectory. - Put all Docker related files in the
dockerdirectory. - Install the Conda environment into an
envdirectory. - Put all notebooks in the
notebooksdirectory. - Put files generated during cleanup and analysis in a
resultsdirectory. - Put project source code in the
srcdirectory. - Name all files to reflect their content or function.
You will need to have the appropriate version of the NVIDIA CUDA Toolkit installed on your workstation. For this repo we are using NVIDIA CUDA Toolkit 11.0 (documentation).
After installing the appropriate version of the NVIDIA CUDA Toolkit you will need to set the following environment variables.
$ export CUDA_HOME=/usr/local/cuda-11.0
$ export PATH=$CUDA_HOME/bin:$PATH
$ export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATHIbex users do not neet to install NVIDIA CUDA Toolkit as the relevant versions have already been
made available on Ibex by the Ibex Systems team. Users simply need to load the appropriate version
using the module tool.
$ module load cuda/11.0.1After adding any necessary dependencies that should be downloaded via conda to the
environment.yml file and any dependencies that should be downloaded via pip to the
requirements.txt file you create the Conda environment in a sub-directory ./envof your project
directory by running the following commands.
export ENV_PREFIX=$PWD/env
export HOROVOD_CUDA_HOME=$CUDA_HOME
export HOROVOD_NCCL_HOME=$ENV_PREFIX
export HOROVOD_GPU_OPERATIONS=NCCL
export HOROVOD_NCCL_LINK=SHARED
conda env create --prefix $ENV_PREFIX --file environment.yml --forceOnce the new environment has been created you can activate the environment with the following command.
conda activate $ENV_PREFIXNote that the ENV_PREFIX directory is not under version control as it can always be re-created as
necessary.
For your convenience these commands have been combined in a shell script ./bin/create-conda-env.sh.
The script should be run from the project root directory as follows.
./bin/create-conda-env.sh # assumes that $CUDA_HOME is set properlyAfter building the Conda environment you can check that Horovod has been built with support for TensorFlow and MPI with the following command.
conda activate $ENV_PREFIX # optional if environment already active
horovodrun --check-buildYou should see output similar to the following.
Horovod v0.21.3:
Available Frameworks:
[X] TensorFlow
[X] PyTorch
[ ] MXNet
Available Controllers:
[X] MPI
[X] Gloo
Available Tensor Operations:
[X] NCCL
[ ] DDL
[ ] CCL
[X] MPI
[X] Gloo
The list of explicit dependencies for the project are listed in the environment.yml file. To see
the full lost of packages installed into the environment run the following command.
conda list --prefix $ENV_PREFIXIf you add (remove) dependencies to (from) the environment.yml file or the requirements.txt file
after the environment has already been created, then you can re-create the environment with the
following command.
$ conda env create --prefix $ENV_PREFIX --file environment.yml --forceIn order to build Docker images for your project and run containers with GPU acceleration you will need to install Docker, Docker Compose and the NVIDIA Docker runtime.
Detailed instructions for using Docker to build and image and launch containers can be found in
the docker/README.md.