This repo contains scripts to perform FourCastNeXt training and inference using ERA5 from NCI project rt52
.
For technical details in the model and training methods, please refer to the preprint.
@article{guo2024fourcastnext,
title={FourCastNeXt: Improving FourCastNet Training with Limited Compute},
author={Edison Guo and Maruf Ahmed and Yue Sun and Rahul Mahendru and Rui Yang and Harrison Cook and Tennessee Leeuwenburg and Ben Evans},
journal={arXiv preprint arXiv:2401.05584},
year={2024}
}
-
Ask to join NCI project rt52 on mancini.
-
Run
bash setup.sh
to set up the environment. This script sets up a Python virtualenv with all the dependencies. The virtualenv directorypython_env
is in the same directory assetup.sh
. -
The entrypoint of training is
run_trainer.pbs
. The inference script isrun_inference.pbs
. Before you run these scripts, please open them in an text editor and fill in<your NCI project>
forrun_trainer.pbs
, and<output path>
and<checkpoint path>
forrun_inference.pbs
.
run_trainer.pbs
sets up a training cluster. The training cluster consists of a GPU cluster for
Distributed Data Parallel (DDP) training and a ray cluster for data loading. The ray cluster uses the
current GPU node as the coordinator and launches three separate CPU Gadi jobs for the data workers.
The data workers will join the ray cluster as soon as the CPU Gadi jobs start. The data workers will
be automatically shut down when the ray coordinator is being shut down.