This repository contains the scripts to perform Nucleotide Augmentation (NTA) on a labeled data set of amino acid sequences as well as to recreate the analysis described in Minot & Reddy 2022 [1].
Before running any of the scripts, the necessary packages can be installed via Conda, the open source package management system [2], and one of the two provided environment files nta_environment.yml
using following commands:
cd nta_environment
conda env create -f nta_environment.yml
conda activate nta_env
If virtualenv is preferred the environment can be setup from nta_environment/
from requirements.txt
, via The following 3 commands using python 3.8.5 and pip 20.1.1:
python -m venv nta_env
- Next, on Windows, run:
nta_env\Scripts\activate.bat
Or, on Unix / MacOS, run:source nta_env/bin/activate
- Then pip install the requirements via:
pip install -r requirements.txt
The folder run_nta
contains instructions and code to apply NTA to your own data.
The full pipeline to reproduce the study, written in Python, can be summarised into three consecutive steps:
- Data preprocessing and Nucleotide Augmentation (NTA).
- Model training and evaluation.
- Plot results.
Unzip data/
. Perform preprocessing by running the following commands from the main folder:
cd preprocessing
./preprocessing.sh
This will execute train/val/test splitting and NTA for the GB1, AAV, and Trastuzumab data sets and save the resulting subsets in separate .CSV files in their respective subfolders under data/
.
Performed for each data set by executing the following commands from the main folder:
cd scripts
./train_eval_gb1.sh
./train_eval_aav.sh
./train_eval_trastuzumab.sh
This will populate the folder results/
with .CSV files containing the training and evaluation results for each data set and in the appropriate format for plotting in Step 3.
Performed by running the following commands from the main folder:
cd plot
python plot_gb1.py
python plot_aav.py
python plot_trastuzumab.py
If you use the code in this repository for your research, please cite our paper.
@article{10.1093/bioadv/vbac094,
author = {Minot, Mason and Reddy, Sai T},
title = "{Nucelotide augmentation for machine learning-guided protein engineering}",
journal = {Bioinformatics Advances},
year = {2022},
month = {12},
issn = {2635-0041},
doi = {10.1093/bioadv/vbac094},
url = {https://doi.org/10.1093/bioadv/vbac094},
note = {vbac094},
eprint = {https://academic.oup.com/bioinformaticsadvances/advance-article-pdf/doi/10.1093/bioadv/vbac094/47762525/vbac094.pdf},
}