Skip to content
/ GEP2 Public

A fully-automated quality control tool for parallel evaluation of genome assemblies

License

Notifications You must be signed in to change notification settings

diegomics/GEP2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GEP2

Genome Evaluation Pipeline (v2)

This repository contains a significantly updated version of GEP that builds upon lessons learned from ERGA and GAME.

Data is entered via a simple table, and configuration is managed through a tidy control panel. GEP2 uses a modern Snakemake version with containers and can run on a server/cluster (SLURM) or a local computer.

Please cite: Genome Evaluation Pipeline (GEP): A fully-automated quality control tool for parallel evaluation of genome assemblies. https://doi.org/10.1093/bioadv/vbaf147


Requirements

  • conda
  • apptainer

How to Get and Set Up GEP2

1) Get the latest version

GEP2 is adding features rapidly, so please download the latest release (or clone the repo for getting hot fixes faster!)

2) Create the GEP2 Conda Environment

The environment contains Snakemake packages and NomNom. Enter the GEP folder and:

conda env create -f install.yml

3) Enter Your Data in a Table

You can use Google Drive, Excel, LibreOffice, Numbers, CSV, TSV, etc.

The table should contain these columns:

sp_name asm_id skip asm_files read_type read_files

Please see the example table.Please see the example table. The easiest is to make a copy of that Google table (File->Make a copy) and replace the fields with your data. Remember to change permissions (Share-> change General access to "Anyone with the link" viewer)

Column Descriptions:

  • sp_name: Species name in binomial nomenclature (e.g., Vultur gryphus)
  • asm_id: Assembly identifier (e.g., hifiasm_l2, yahs_test, or ASM2260516v1)
  • skip: Flag assemblies for selective analysis skipping. Leave empty (or - or off) to run all analyses. Set to on to flag this assembly, then control which analyses to skip in control_panel.yaml using SKIP_KMER, SKIP_INSP, SKIP_HIC, etc. Useful for running quick QC on draft assemblies while running full analysis on final assemblies.
  • asm_files: Path to assembly file, URL, or accession number (e.g., GCA_022605165.1). If it's a link or accession, the pipeline will download the data automatically. If Pri/Alt or (Hap1/Hap2) assemblies available, add as comma-separated, like: GCA_963854735.1, GCA_963694935.1
  • read_type: Can be illumina, 10x, hifi, or ont (variations like PacBio, paired-end, linked-read, arima, promethion and others should also work fine)
  • read_files: Comma-separated list of paths to read files. Can also be accession numbers (e.g., ERR12205285,ERR12205286). For paired-end reads, list as: forward1,reverse1,forward2,reverse2. Also can use pattern expansion in paths, like /readsA/*.fq.gz, /readsB/*.fq.gz

4) Configure the Control Panel

Add the table path/address and select different options in:

config/control_panel.yaml

5) Configure Cluster or Computer Parameters

GEP2/execution/
├── local/
│   └── config.yaml
└── slurm/
    └── config.yaml

IMPORTANT: You can tweak per-tool resources boundaries in GEP2/config/resources.yaml

6) Run!

First run takes longer as containers need to be built.

Load the conda environment like conda activate GEP2_env and in the GEP2 folder run:

On HPC/Server/Cluster using Slurm:

nohup snakemake --profile execution/slurm &

On Local Computer:

nohup snakemake --profile execution/local &

About the Command:

  • nohup runs Snakemake in a way that won't be interrupted if you lose connection to the server/cluster
  • The trailing & runs the command in the background, allowing you to continue using the terminal

Dry Run (Recommended):

Before running the full pipeline, perform a dry run to check what will execute and catch any errors:

snakemake --profile execution/slurm --dry-run

You can also inspect:

  • GEP2_results/data_config.yaml
  • GEP2_results/download_manifest.json

Results Structure

Open the report with a markdown renderer (VS Code works well):

GEP2_results/{sp_name}/{asm_id}/{asm_id}_report.md

Directory Structure:

GEP2_results/
├── data/
│   └── {sp_name}/
│       └── reads/
│           ├── {read_type}/
│           │   ├── {read_symlink}
│           │   ├── kmer_db_k{k-mer_length}/
│           │   │   └── {read_name}.meryl
│           │   ├── logs/
│           │   └── processed/
│           │       ├── {read_type}_Path{number}_{read_name}_{process}.fq.gz
│           │       └── reports/
│           │           └── multiqc_report.html
│           └── ...
├── data_config.yaml
├── data_table_{hash}.csv
├── downloaded_data/
│   └── {sp_name}/
│       ├── assemblies/
│       │   └── {asm_file}
│       └── reads/
│           └── {read_type}/
│               └── {read_file}
├── download_manifest.json
└── {sp_name}/
    └── {asm_id}/
        ├── {asm_id}_report.md
        ├── compleasm/
        │   └── {asm_file_name}/
        │       ├── {asm_file_name}_results.tar.gz
        │       └── {asm_file_name}_summary.txt
        ├── gfastats/
        │   └── {asm_file_name}_stats.txt
        ├── hic/
        │   └── {asm_file_name}/
        │       ├── {asm_file_name}.cool
        │       ├── {asm_file_name}.mcool
        │       ├── {asm_file_name}.pairs.gz
        │       ├── {asm_file_name}.pairtools_stats.txt
        │       ├── {asm_file_name}.pretext
        │       ├── {asm_file_name}_tracks.pretext
        │       ├── {asm_file_name}_snapshots
        │       │   └── {asm_file_name}_FullMap.png
        │       └── tracks
        │           └── ...bedgraph
        ├── inspector/
        │   └── {asm_file_name}/
        │       ├── ..
        │       └── summary_statistics
        ├── k{k-mer_length}/
        │   ├── {asm_id}.hist
        │   ├── {asm_id}.meryl
        │   └── genomescope2/
        │       └── {asm_id}_linear_plot.png
        ├── logs/
        └── merqury/
            ├── ..
            ├── {asm_file_name}.completeness.stats
            ├── {asm_file_name}.qv
            └── ...png

Main tools:

tool doi
blobtools -
chromap 10.1038/s41467-021-26865-w
cooler 10.1093/bioinformatics/btz540
compleasm 10.1093/bioinformatics/btad595
diamond 10.1038/s41592-021-01101-x
enabrowsertools -
fastp 10.1093/bioinformatics/bty560
fastqc -
fcs-gx 10.1186/s13059-024-03198-7
genomescope2 10.1038/s41467-020-14998-3
gfastats 10.1093/bioinformatics/btac460
hifiadapterfilt 10.1186/s12864-022-08375-1
inspector 10.1186/s13059-021-02527-4
longdust -
merqury 10.1186/s13059-020-02134-9
minimap 10.1093/bioinformatics/bty191
multiqc 10.1093/bioinformatics/btw354
nanoplot 10.1093/bioinformatics/btad311
pairtools 10.1101/2023.02.13.528389
pretextmap -
sambamba 10.1093/bioinformatics/btv098
samtools 10.1093/gigascience/giab008
sdust -
tidk 10.1093/bioinformatics/btaf049

About

A fully-automated quality control tool for parallel evaluation of genome assemblies

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages