Genome Evaluation Pipeline (v2)
This repository contains a significantly updated version of GEP that builds upon lessons learned from ERGA and GAME.
Data is entered via a simple table, and configuration is managed through a tidy control panel. GEP2 uses a modern Snakemake version with containers and can run on a server/cluster (SLURM) or a local computer.
Please cite: Genome Evaluation Pipeline (GEP): A fully-automated quality control tool for parallel evaluation of genome assemblies. https://doi.org/10.1093/bioadv/vbaf147
condaapptainer
GEP2 is adding features rapidly, so please download the latest release (or clone the repo for getting hot fixes faster!)
The environment contains Snakemake packages and NomNom. Enter the GEP folder and:
conda env create -f install.ymlYou can use Google Drive, Excel, LibreOffice, Numbers, CSV, TSV, etc.
The table should contain these columns:
| sp_name | asm_id | skip | asm_files | read_type | read_files |
|---|---|---|---|---|---|
Please see the example table.Please see the example table. The easiest is to make a copy of that Google table (File->Make a copy) and replace the fields with your data. Remember to change permissions (Share-> change General access to "Anyone with the link" viewer)
- sp_name: Species name in binomial nomenclature (e.g.,
Vultur gryphus) - asm_id: Assembly identifier (e.g.,
hifiasm_l2,yahs_test, orASM2260516v1) - skip: Flag assemblies for selective analysis skipping. Leave empty (or
-oroff) to run all analyses. Set toonto flag this assembly, then control which analyses to skip incontrol_panel.yamlusingSKIP_KMER,SKIP_INSP,SKIP_HIC, etc. Useful for running quick QC on draft assemblies while running full analysis on final assemblies. - asm_files: Path to assembly file, URL, or accession number (e.g.,
GCA_022605165.1). If it's a link or accession, the pipeline will download the data automatically. If Pri/Alt or (Hap1/Hap2) assemblies available, add as comma-separated, like:GCA_963854735.1, GCA_963694935.1 - read_type: Can be
illumina,10x,hifi, oront(variations likePacBio,paired-end,linked-read,arima,promethionand others should also work fine) - read_files: Comma-separated list of paths to read files. Can also be accession numbers (e.g.,
ERR12205285,ERR12205286). For paired-end reads, list as:forward1,reverse1,forward2,reverse2. Also can use pattern expansion in paths, like/readsA/*.fq.gz, /readsB/*.fq.gz
Add the table path/address and select different options in:
config/control_panel.yaml
GEP2/execution/
├── local/
│ └── config.yaml
└── slurm/
└── config.yaml
IMPORTANT: You can tweak per-tool resources boundaries in GEP2/config/resources.yaml
First run takes longer as containers need to be built.
Load the conda environment like conda activate GEP2_env and in the GEP2 folder run:
On HPC/Server/Cluster using Slurm:
nohup snakemake --profile execution/slurm &nohup snakemake --profile execution/local &nohupruns Snakemake in a way that won't be interrupted if you lose connection to the server/cluster- The trailing
&runs the command in the background, allowing you to continue using the terminal
Before running the full pipeline, perform a dry run to check what will execute and catch any errors:
snakemake --profile execution/slurm --dry-runYou can also inspect:
GEP2_results/data_config.yamlGEP2_results/download_manifest.json
Open the report with a markdown renderer (VS Code works well):
GEP2_results/{sp_name}/{asm_id}/{asm_id}_report.md
GEP2_results/
├── data/
│ └── {sp_name}/
│ └── reads/
│ ├── {read_type}/
│ │ ├── {read_symlink}
│ │ ├── kmer_db_k{k-mer_length}/
│ │ │ └── {read_name}.meryl
│ │ ├── logs/
│ │ └── processed/
│ │ ├── {read_type}_Path{number}_{read_name}_{process}.fq.gz
│ │ └── reports/
│ │ └── multiqc_report.html
│ └── ...
├── data_config.yaml
├── data_table_{hash}.csv
├── downloaded_data/
│ └── {sp_name}/
│ ├── assemblies/
│ │ └── {asm_file}
│ └── reads/
│ └── {read_type}/
│ └── {read_file}
├── download_manifest.json
└── {sp_name}/
└── {asm_id}/
├── {asm_id}_report.md
├── compleasm/
│ └── {asm_file_name}/
│ ├── {asm_file_name}_results.tar.gz
│ └── {asm_file_name}_summary.txt
├── gfastats/
│ └── {asm_file_name}_stats.txt
├── hic/
│ └── {asm_file_name}/
│ ├── {asm_file_name}.cool
│ ├── {asm_file_name}.mcool
│ ├── {asm_file_name}.pairs.gz
│ ├── {asm_file_name}.pairtools_stats.txt
│ ├── {asm_file_name}.pretext
│ ├── {asm_file_name}_tracks.pretext
│ ├── {asm_file_name}_snapshots
│ │ └── {asm_file_name}_FullMap.png
│ └── tracks
│ └── ...bedgraph
├── inspector/
│ └── {asm_file_name}/
│ ├── ..
│ └── summary_statistics
├── k{k-mer_length}/
│ ├── {asm_id}.hist
│ ├── {asm_id}.meryl
│ └── genomescope2/
│ └── {asm_id}_linear_plot.png
├── logs/
└── merqury/
├── ..
├── {asm_file_name}.completeness.stats
├── {asm_file_name}.qv
└── ...png
| tool | doi |
|---|---|
| blobtools | - |
| chromap | 10.1038/s41467-021-26865-w |
| cooler | 10.1093/bioinformatics/btz540 |
| compleasm | 10.1093/bioinformatics/btad595 |
| diamond | 10.1038/s41592-021-01101-x |
| enabrowsertools | - |
| fastp | 10.1093/bioinformatics/bty560 |
| fastqc | - |
| fcs-gx | 10.1186/s13059-024-03198-7 |
| genomescope2 | 10.1038/s41467-020-14998-3 |
| gfastats | 10.1093/bioinformatics/btac460 |
| hifiadapterfilt | 10.1186/s12864-022-08375-1 |
| inspector | 10.1186/s13059-021-02527-4 |
| longdust | - |
| merqury | 10.1186/s13059-020-02134-9 |
| minimap | 10.1093/bioinformatics/bty191 |
| multiqc | 10.1093/bioinformatics/btw354 |
| nanoplot | 10.1093/bioinformatics/btad311 |
| pairtools | 10.1101/2023.02.13.528389 |
| pretextmap | - |
| sambamba | 10.1093/bioinformatics/btv098 |
| samtools | 10.1093/gigascience/giab008 |
| sdust | - |
| tidk | 10.1093/bioinformatics/btaf049 |