This is our implementation of A3C and the corresponding synchronous version A2C based on the paper Asynchronous Methods for Deep Reinforcement Learning from Mnih, et al. We also combined this with General Advantage Estimation as it has shown improved performance for policy gradient methods.
- models: Neural network models for actor and critic.
- optimizers: Optimizers with shared statistics for A3C.
- util: Helper methods to make main code more readable.
- Activate the anaconda environment
source activate my_env
- Execute the a3c_runner script (the default environment is CartpoleStabShort-v0)
Training run from scratch:
python3 my/path/to/a3c_runner.py
Continue training run from an existing policy:
python3 my/path/to/a3c_runner.py --path my_model_path
More console arguments (e.g. hyperparameter changes) can be added to the run, for details see
python3 my/path/to/a3c_runner.py --help
- (Optional) Start tensorboard to monitor training progress
tensorboard --logdir=./experiments/runs
- Activate the anaconda environment
source activate my_env
- Execute the a3c_runner script
python3 my/path/to/a3c_runner.py --path my_model_path --test
e.g. load pretrained models in test mode:
python3 a3c_runner.py --env-name CartpoleStabShort-v0 --max-action 5 --test --path experiments/best_models/a3c/stabilization/simulation/model_split_T-53420290_global-7597.67863_test-9999.97380.pth.tar
python3 a3c_runner.py --env-name CartpoleSwingShort-v0 --max-action 10 --test --path experiments/best_models/a3c/swing_up/model_split_T-13881240_global-4532.753498284313_test-19520.67601316739.pth.tar
python3 a3c_runner.py --env-name Qube-v0 --max-action 5 --test --path experiments/best_models/a3c/qube/500Hz/model_split_T-164122000_global-3.66047_test-5.51714.pth.tar
python3 a3c_runner.py --env-name Qube-v0 --max-action 5 --test --path experiments/best_models/a3c/qube/50Hz/model_split_T-72839490_global-2.077353393893449_test-3.4406189782812775.pth.tar