The necessary packages for running the benchmarking can be installed from the environment.yml file. After creating a Conda environment from the yml, run the following installations in R:
install.packages(c('dplyr', 'pbapply', 'lmerTest', 'parallel', 'lme4', 'plyr', 'optparse', 'logging', 'data.table', 'ggplot2', 'grid', 'pheatmap'))
install.packages(c("pkgmaker", "stringi", "doParallel", "SimSeq", "tidyr", "devtools", "TcGSA", "MCMCprecision")) # Come back to install devtools if necessary
install.packages("BiocManager")
BiocManager::install(c("phyloseq", "microbiome", "SparseDOSSA2", "ALDEx2", "ANCOMBC", "TreeSummarizedExperiment", "Maaslin2", "DESeq2", "edgeR"))
library("devtools")
install_github("biobakery/MaAsLin3")
Each evaluation folder (community_shift, deep_sequencing, general_evaluations, groups, omps, and unscaled) is structured in the same way. Each evaluation has:
- A
data_generationfolder with scripts to generate data according to SparseDOSSA2 or ANCOM-BC's generator - A
run_toolsfolder with scripts to run each differential abundance tool on the generated datasets - An
evaluate_toolsfolder with scripts to compute accuracy metrics for each tool and produce plots - A
.pyfile with a workflow to generate the simulated data for all sets of parameters and run the differential abundance tools on all generated datasets - A
.txtfile with the set of generators to use
The library folder contains general-purpose functions that apply over multiple evaluation types such as functions for data generation and accuracy evaluation.
The following command runs the community shift evaluations.
python community_shift/evaluate_community_shift.py \
--parameters community_shift/evaluate_community_shift.txt \
-o community_shift/
The following command runs the deep sequencing evaluations.
python deep_sequencing/evaluate_deep_sequencing.py \
--parameters deep_sequencing/evaluate_deep_sequencing.txt \
-o deep_sequencing/
The following command runs the ANCOM-BC evaluation.
python general_evaluations/evaluate_general.py \
--parameters general_evaluations/evaluate_general.txt \
-o general_evaluations/
The following command runs the groupwise difference evaluations.
python groups/evaluate_groups.py \
--parameters groups/evaluate_groups.txt \
-o groups/
The following command runs the ordered predictor evaluations.
python omps/evaluate_omps.py \
--parameters omps/evaluate_omps.txt \
-o omps/
The following command runs the spike-in abundance evaluations.
python unscaled/evaluate_unscaled.py \
--parameters unscaled/evaluate_unscaled.txt \
-o unscaled/
The randomization_test directory is structured similarly. It contains:
- A
run_toolsfolder with scripts to run the randomized or non-randomized datasets - An
evaluate_toolsfolder with a script to compute accuracy metrics for each tool and produce plots - A
.pyfile with a workflow to generate the shuffled data and run the differential abundance tools on all generated datasets
The following command runs the randomization test evaluations.
python randomization_test/evaluate_randomization.py \
-o randomization_test/
The real_data_absolute_abundance directory contains three sub-directories, one for each dataset. Each contains:
- A
datafolder with the abundance data and metadata. These data were obtained from the supplementary information of each study or from the ENA nucleotide browser's display of per-sample metadata including read depth. - A
resultsfolder with the script outputs. - A
run_scriptsfolder with scripts to run the differential abundance tools on each dataset.
There is also a join_results.R script to combine the results and create plots.
The scripts folder contains the script to perform the MetaPhlAn analysis of the HMP2 data. The analysis folder contains the following:
- A
datafolder with the taxonomic profiles, metatranscriptomics profiles, and patient metadata. Thepathabundances_3files are downloaded from https://www.ibdmdb.org/. - A
resultsfolder with outputs from the differential abundance tools - A
run_scriptsfolder with scripts to run each differential abundance tool - An
age_associations.pyscript to run MaAsLin 3 for the pediatric and adult IBD cohorts - A
diet_associations.pyscript to run MaAsLin 3 for diet associations - An
ibd_associations.pyscript to run all differential abundance tools - An
mtx_associations.pyscript to run all metatranscriptomics analyses - An
analyze_results.Rscript to compile the taxonomic abundance results and create figures - An
opposite_associations.Rscript to show an example of opposite abundance and prevalence associations from HMP2
The following command performs the MetaPhlAn analysis.
python run_mpa.py -i data/hmp2_qc/ \
-o maaslin3_benchmark/HMP2/outputs \
--input-extension fastq.gz --bowtie metaphlan4/
The following commands run the HMP2 analysis.
python age_associations.py \
-o maaslin3_benchmark/HMP2/analysis_age/ \
--workingDirectory $(pwd)
python ibd_associations.py \
-o maaslin3_benchmark/HMP2/analysis/ \
--workingDirectory $(pwd)
python diet_associations.py \
-o maaslin3_benchmark/HMP2/analysis_diet/ \
--workingDirectory $(pwd)
python mtx_associations.py \
-o maaslin3_benchmark/HMP2/analysis_mtx/ \
--workingDirectory $(pwd)