A pipeline developed in collaboration with Exeter University
Nextflow pipelines require a few prerequisites. There is further documentation on the nf-core webpage here, about how to install Nextflow.
- Docker or Singularity.
- Java and openJDK >= 8 (Please Note: When installing Java versions are
1.VERSION
soJava 8
isJava 1.8
). - Nextflow >=
v24.04.1
.
To install the pipeline please use the following commands but replace VERSION with a release.
wget https://github.com/Eco-Flow/pollen-metabarcoding/archive/refs/tags/VERSION.tar.gz -O - | tar -xvf -
or
curl -L https://github.com/Eco-Flow/pollen-metabarcoding/archive/refs/tags/VERSION.tar.gz --output - | tar -xvf -
This will produce a directory in the current directory called pollen-metabarcoding-VERSION
which contains the pipeline.
--input
- Path to a comma-separated file containing sample id and either path to the fastq(s) or an SRA ID. Each row contains information on a singular sample.--database
- Path to database fasta file to be used in vsearch sintax module.--outdir
- Path to the output directory where the results will be saved (you have to use absolute paths to storage on cloud infrastructure) [default: results].--FW_primer
- Sequence of the forward primer.--RV_primer
- Sequence of the reverse primer.
--single_end
- Tells pipeline whether to expect single end or paired-end sra data [default: false].--custom_config
- A path/url to a custom configuration file.--publish_dir_mode
- Method used to save pipeline results to output directory. (accepted: symlink, rellink, link, copy, copyNoFollow, move) [default: copy].--clean
- Enable cleanup function [default: false].--forks
- Maximum number of each process that will be run in parallel.--max_cpus
- Maximum number of CPUs that can be requested for any single job [default: 16].--max_memory
- Maximum amount of memory that can be requested for any single job [default: 128.GB].--max_time
- Maximum amount of time that can be requested for any single job [default: 48.h].--help
- Display help text.
--cutadapt_min_overlap
- Cutadapt minimum overlap parameter [default: 3].--cutadapt_max_error_rate
- Cutadapt max error rate parameter [default: 0.1].--pacbio
- Cutadapt pacbio parameter.--iontorrent
- Cutadapt iontorrent parameter.--pear_p_value
- Pear p-value parameter.--pear_min_overlap
- Pear min overlap parameter.--pear_max_len
- Pear max length parameter.--pear_min_len
- Pear min length parameter.--pear_trimmed_min_len
- Pear trimmed min length parameter.--pear_quality
- Pear quality score threshold parameter.--pear_max_uncalled
- Pear percentage max uncalled parameter.--pear_stat_test
- Pear statistical test parameter.--pear_scoring_method
- Pear scoring method parameter.--pear_phred
- Pear phred score threshold parameter.--fastq_maxee
- vsearch fastq max expected errors parameter.--fastq_minlen
- vsearch fastq min length parameter.--fastq_maxns
- vsearch max Ns parameter.--fasta_width
- vsearch fasta width parameter.--minuniquesize
- vsearch min unique size parameter [default: 2].--derep_strand
- vsearch fastq dereplicate strand parameter.--sintax_cutoff
- vsearch sintax cutoff parameter.--sintax_strand
- vsearch sintax strand parameter.--seed
- vsearch sinxtax random seed parameter [default: 1312].--ncbi_settings
- Path to NCBI settings folder.--certificate
- Path to certificate file.
--awsqueue
- aws queue to use with aws batch.--awsregion
- aws region to use with aws batch.--awscli
- path to aws cli installation on host instance.--s3bucket
- s3 bucket path to use as work directory.
In order to create an input sample sheet in the correct format, you can use the python script -> here.
This has been edited from an nf-core rnaseq script.
You can use it when all your fastq files are in a single folder and end with _R1_001.fastq.gz
and/or _R2_001.fastq.gz
.
Usage:
python3 fastq_dir_to_samplesheet.py /path/to/fastq/files Input.csv
-
Where
Input.csv
is the name of the file I want to name the samplesheet. -
And
/path/to/fastq/files
is the full path to the folder with your fastq data. -
If you have a variant ending to read 1 and read 2, you can specify endings with the
--read1_extension
and--read2_extension
flags. Default is "_R1_001.fastq.gz".
Once completed, your output directory should be called results (unless you specified another name) and should contain the following directory structure:
results
├── cut_tsvs
├── cutadapt
│ ├── fastqs
│ └── logs
├── pear
│ ├── assembled
│ ├── discarded
│ └── unassembled
├── pipeline_info
│ ├── co2_emissions
│ │ ├── co2footprint_report.html
│ │ ├── co2footprint_summary.html
│ │ └── co2footprint_trace.txt
│ ├── execution_report.html
│ ├── execution_timeline.html
│ ├── execution_trace.txt
│ ├── pipeline_dag.html
│ └── software_versions.yml
├── r-processing
│ └── sample
│ ├── classified.tsv
│ ├── pie_charts
│ │ ├── family.pdf
│ │ ├── genus.pdf
│ │ └── order.pdf
│ └── summary.tsv
├── sratools_fasterq-dump
│ └── sample
├── usearch
│ └── sintax_summary
│ ├── sample
│ ├── class_summary.txt
│ ├── domain_summary.txt
│ ├── family_summary.txt
│ ├── genus_summary.txt
│ ├── kingdom_summary.txt
│ ├── order_summary.txt
│ ├── phylum_summary.txt
│ └── species_summary.txt
└── vsearch
├── derep
│ ├── clusterings
│ ├── fastas
│ └── logs
├── fastq_filter
│ ├── fastas
│ └── logs
└── sintax
cut_tsvs
- directory containing tsvs of first 2 columns of sintax data
cutadapt
fastqs
- directory containing adapter trimmed fastqs files for each sample.logs
- directory containing cutadapt trimming statistics for each sample.
pear
assembled
- directory containing fastqs of successfully merged reads for each sample.discarded
- directory containing fastqs of reads disacrded due to quality for each sample.unassembled
- directory containing fastqs of reads unable to be merged for each sample.
pipeline_info
- directory containing pipeline statistics including co2 emissions.
r-processing
classfied.tsv
- tsv containing taxonomy prediction information.pie_charts
- pdfs of top predicted species for different taxonomic levelsummary.tsv
- tsv containing summary statistics.
sratools_fasterq-dump
- fastqs obtained from SRA ID.
usearch
- text files containing the name, number of reads, percentage of reads and cumulative percentage of reads for each taxonomic level.
vsearch
derep
clusterings
- directory containing dereplicated clusterings for each sample.fastas
- directory containing dereplicated fastas for each sample.logs
- directory containing vsearch dereplicate statistics for each sample.
fastq_filter
fastas
- directory containing filtered fastas for each sample.logs
- directory containing vsearch fastq_filter statistics for each sample.
sintax
- directory containing vsearch sintax taxonomy prediction output files.
The basic configuration of processes using labels can be found in conf/base.config
.
Module specific configuration using process names can be found in conf/modules.config
.
Please note: The nf-core CUTADAPT
module is labelled as process_medium
in the module main.nf
. However for pollen metabarcoding data the fastqs are significantly smaller, so this resource requirement has been overwritten inside conf/modules.config
to match the process_single
resource requirments.
This pipeline is designed to run in various modes that can be supplied as a comma separated list i.e. -profile profile1,profile2
.
Please select one of the following profiles when running the pipeline.
docker
- This profile uses the container software Docker when running the pipeline. This container software requires root permissions so is used when running on cloud infrastructure or your local machine (depending on permissions). Please Note: You must have Docker installed to use this profile.singularity
- This profile uses the container software Singularity when running the pipeline. This container software does not require root permissions so is used when running on on-premise HPCs or you local machine (depending on permissions). Please Note: You must have Singularity installed to use this profile.apptainer
- This profile uses the container software Apptainer when running the pipeline. This container software does not require root permissions so is used when running on on-premise HPCs or you local machine (depending on permissions). Please Note: You must have Apptainer installed to use this profile.
aws_batch
- This profile is used if you are running the pipeline on AWS utilising the AWS Batch functionality. Please Note: You must use theDocker
profile with with AWS Batch.test
- This profile is used if you want to test running the pipeline on your infrastructure. Please Note: You do not provide any input parameters if this profile is selected but you still provide a container profile.
If you want to run this pipeline on your institute's on-premise HPC or specific cloud infrastructure then please contact us and we will help you build and test a custom config file. This config file will be published to our configs repository.
Please note: The -resume
flag uses previously cached successful runs of the pipeline.
- Running the pipeline with Docker profiles:
nextflow run main.nf -profile docker -resume --input data/input_full-s3.csv --database "s3://pollen-metabarcoding-test-data/data/viridiplantae_all_2014.sintax.fa" --FW_primer "ATGCGATACTTGGTGTGAAT" --RV_primer "GCATATCAATAAGCGGAGGA"
(The example database was obtained from molbiodiv/meta-barcoding-dual-indexing).
The database should be a list of fasta sequences, where the header name contains kingdom (k), phylum (p), c (class), o (order), f (family), g (genus) and s (species) identifiers (separated by comma). If your database does not contain all these definitions the pipeline will fail. If you do not have this structure, you can run the script bin/sanitise_database.pl
on your database, and it will infill any missing classification data with null
.
- Running the pipeline with Singularity and test profiles:
nextflow run main.nf -profile singularity,test_small -resume
- Running the pipeline with additional parameters:
nextflow run main.nf -profile docker -resume \
--input data/input_small-s3.csv \
--database "s3://pollen-metabarcoding-test-data/data/viridiplantae_all_2014.sintax.fa" \
--FW_primer "ATGCGATACTTGGTGTGAAT" --RV_primer "GCATATCAATAAGCGGAGGA" \
--fastq_maxee 0.5 --fastq_minlen 250 --fastq_maxns 0 --fasta_width 0 \
--derep_strand "plus" \
--sintax_strand "both" --sintax_cutoff 0.95
- Running the pipeline with a custom config file:
nextflow run main.nf -profile docker,aws_batch -resume --input data/input_manual.csv --database "s3://pollen-metabarcoding-test-data/data/viridiplantae_all_2014.sintax.fa" --FW_primer "ATGCGATACTTGGTGTGAAT" --RV_primer "GCATATCAATAAGCGGAGGA" --custom_config /path/to/custom/config
- Running on Gitpod: If you wish to run on Gitpod, please be aware that the usearch (sintax_summary) module will fail, as Gitpod doesn't allow 32bit programs.
nextflow run main.nf -profile docker,test_small -resume --gitpod
The data used to test this pipeline via the ENA ID: PRJEB26439.
There are two test profiles using this data:
test_small
- contains 3 samples for small, fast testing.
test_full
- contains 47 samples (the entire dataset) for large, real-world replication testing.
If you need any support do not hesitate to contact us at any of:
c.wyatt [at] ucl.ac.uk
ecoflow.ucl [at] gmail.com