Skip to content

Latest commit

 

History

History
157 lines (120 loc) · 16.5 KB

pipeline_details.md

File metadata and controls

157 lines (120 loc) · 16.5 KB

RatesTools Pipeline Details

Michael G. Campana & Ellie E. Armstrong, 2019-2024
Smithsonian's National Zoo and Conservation Biology Institute
Stanford University

The following document provides detailed descriptions of the steps included in the RatesTools pipeline. A directed acyclic graph of the RatesTools pipeline is available here. Please note that the commands shown below only list non-standard options for clarity. We omit basic input/output required for these commands. See the documentation for these programs for operation details.

Table of Contents

  1. prepareRef
  2. alignSeqs
  3. sortBAM
  4. markDuplicates
  5. fixReadGroups
  6. realignIndels
  7. filterBAMs
  8. fixMate
  9. callVariants
  10. genotypegVCFs
  11. genMapIndex
  12. genmapmap
  13. repeatMask
  14. repeatModeler
  15. repeatMaskRM
  16. maskIndels
  17. simplifyBed
  18. filterChr
  19. splitTrios
  20. pullDPGQ
  21. plotDPGQ
  22. splitVCFs
  23. vcftoolsFilterSites
  24. gatkfiltersites
  25. filterRegions
  26. calcDNMRate
  27. summarizeDNM
  28. sanityCheckLogs
  29. sanityCheckLogsVcftools
  30. sanityCheckGatk
  31. sanityCheckLogsRegions
  32. generateSummaryStats
  33. References

prepareRef

The prepareRef process indexes the reference sequence using bwa index [1] (and optionally the specified indexing algorithm) and samtools faidx [2]. It also generates a sequence dictionary using samtools dict.

alignSeqs

The alignSeqs process aligns read pairs (in two separate fastq format files) against the indexed reference sequencing using bwa mem and converts to bam format using samtools view to conserve disk space.

sortBAM

The sortBAM process sorts the bam files from alignSeqs using samtools sort. This process was separated from the alignSeqs step to minimize redundant calculations should there be an error necessitating pipeline resumption.

markDuplicates

The markDuplicates process marks PCR duplicates using either sambamba markdup [3] or Picard [4] MarkDuplicates (java -jar picard.jar MarkDuplicates MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000).

fixReadGroups

The fixReadGroups process adds sample read group information to the bam files using Picard AddOrReplaceReadGroups (java -jar picard.jar AddOrReplaceReadGroups).

realignIndels

The realignIndels process first builds a bam index (.bai) for the fixReadGroups output bam using Picard BuildBamIndex (java -jar picard.jar BuildBamIndex). It then identifies intervals for indel realignment using Genome Analysis Toolkit (GATK) [5]. For GATK 3, this process uses RealignerTargetCreator (java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator) and realigns indels using GATK IndelRealigner (java -jar GenomeAnalysisTk.jar -T IndelRealigner --filter_bases_not_stored). For GATK 4, this process uses LeftAlignIndels (java -jar GenomeAnalysisTk.jar LeftAlignIndels).

filterBAMs

The filterBAM process strictly filters aligned reads using GATK PrintReads following [6] for GATK 3 (java -jar GenomeAnalysisTK.jar -T PrintReads --read_filter BadCigar --read_filter DuplicateRead --read_filter FailsVendorQualityCheck --read_filter HCMappingQuality --read_filter MappingQualityUnavailable --read_filter NotPrimaryAlignment --read_filter UnmappedRead --filter_bases_not_stored --filter_mismatching_base_and_quals). These settings are adjusted for GATK 4 compatibility (java -jar GenomeAnalysisTK.jar PrintReads --read-filter GoodCigarReadFilter --read-filter NotDuplicateReadFilter --read-filter PassesVendorQualityCheckReadFilter --read-filter MappingQualityReadFilter --read-filter MappingQualityAvailableReadFilter --read-filter PrimaryLineReadFilter --read-filter MappedReadFilter --read-filter NotOpticalDuplicateReadFilter --read-filter ProperlyPairedReadFilter).

fixMate

The fixMate process corrects sequence mate pair tags using Picard FixMateInformation (java -jar picard.jar FixMateInformation ADD_MATE_CIGAR=true). It then builds a bam index (.bai) for the output bam file using Picard BuildBamIndex (java -jar picard.jar BuildBamIndex).

callVariants

The callVariants process calls individual sequence variants (gVCF format) using GATK HaplotypeCaller (GATK 3: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -A DepthPerSampleHC -A Coverage -A HaplotypeScore -A StrandAlleleCountsBySample -ERC GVCF -out_mode EMIT_ALL_SITES; GATK 4: java -jar GenomeAnalysisTK.jar HaplotypeCaller -ERC GVCF -G StandardAnnotation -G AS_StandardAnnotation).

genotypegVCFs

The genotypegVCFs process performs joint-genotyping and generates an all-sites VCF for all samples. It then compresses the resulting multi-sample VCF using gzip. For GATK 3, this process uses GenotypeGVCFs (java -jar GenomeAnalysisTK.jar -T GenotypeGVCFs --includeNonVariantSites). For GATK 4, the process first combines the individual gVCFs using CombineGVCFs (java -jar GenomeAnalysisTK.jar CombineGVCFs --convert-to-base-pair-resolution) and then performs joint genotyping using GenotypeGVCFs (java -jar GenomeAnalysisTK.jar GenotypeGVCFs --include-non-variant-sites).

genMapIndex

The genMapIndex process generates the GenMap [7] index for the reference sequence for downstream mappability calculations (genmap index).

genMapMap

The genMapMap process calculates the mappability for the reference sequence using GenMap (genmap map -K 30 -E 2 -b). It then filters the GenMap results using filterGM.rb to exclude any sequences with mappability < 1.0 (filterGM.rb <raw_genmap.bed> 1.0 exclude > <genmap.1.0.bed>).

repeatMask

The repeatMask process uses RepeatMasker [8] to soft-mask repeat regions in the reference sequence based on a pre-defined target species (RepeatMasker -gccalc -nolow -xsmall -species <specified species>). The soft-masked reference sequence is passed to the repeatModeler process.

repeatModeler

The repeatModeler process uses RepeatModeler [9] and the RepeatMasker soft-masked reference sequence to generate a species-specific repeat library for the reference sequence. First, a database is built using BuildDatabase, then the library is built using RepeatModeler. The resulting library (consensi.fa.classified) RepeatMasker -pa ${rm_pa} -gccalc -nolow -lib consensi.fa.classified is passed to the repeatMaskRM process.

repeatMaskRM

The repeatMaskRM process uses RepeatMasker and the RepeatModeler custom repeat library to soft mask the reference sequence (RepeatMasker -gccalc -nolow -xsmall -lib consensi.fa.classified). The RepeatMasker out file (.out) is then converted to bed using RM2bed.rb.

maskIndels

Using indels2bed.rb, the maskIndels process scans the all-sites VCF generated by the genotypeGVCFs process for indels and generates a bed file excluding all sites a set number of bp upstream/downstream of the identified indels (indels2bed.rb <all-sites.vcf> <indel pad in bp> > <indels.bed>).

simplifyBed

The simplifyBed process identifies and merges overlapping bed entries in the bed results from the repeatMaskRM, genMapMap and maskIndels processes using BEDtools [11] (bedtools merge -i <regions.bed>).

filterChr

If a list of target chromosomes was provided, this process filters using VCFtools [10] the output VCFs to include only the specified target regions using the --chr option. The resulting VCFs are compressed using gzip.

phaseTrio

This process optionally phases the all-sites VCF using WhatsHap trio phasing [11,12] (whatshap phase --ped <ped.file> --reference=<reference sequence -o <output-phased.vcf> <all-sites.vcf> <final-bams>). The resulting vcf is compressed using gzip.

splitTrios

Using VCFtools, the splitTrios process generates all-sites VCFs for each offspring and its parents (vcftools --recode --indv <sire> --indv <dam> --indv <offspring>). The number of resulting VCFs will thus equal the number of offspring in the dataset. The resulting VCFs are compressed using gzip.

pullDPGQ

Using BCFtools [13], the pullGQDP process extracts the DP and GQ information from the chromosome-filtered VCFs (bcftools view -v snps <chromosome_vcf> | bcftools query -f "%CHROM %POS [ %DP] [ %GQ]\n").

plotDPGQ

The plotDPGQ process plots the DP and GQ information from pullDPGQ using plotDPGQ.R.

splitVCFs

The splitVCFs process splits each VCF generated during the filterChr process by chromosome/contig name for parallelization of downstream processes using nextflow_split.rb. The resulting VCFs are compressed using bgzip.

vcftoolsFilterSites

The filterSites process filters the split VCF files from the splitVCFs process using VCFtools and the site filters provided in the config file (vcftools --recode <site_filters>). Set the vcftools_site_filters parameter to "NULL" to turn off this filter. The resulting VCFs are compressed using bgzip [14].

gatkFilterSites

The filterSites process filters the VCF files from the splitVCFs/vcftoolsFilterSites process using GATK and the site filters provided in the config file (GATK3: java -jar GenomeAnalysisTK.jar -T VariantFiltration and java -jar GenomeAnalysisTK.jar -T SelectVariants --excludeFiltered; GATK 4: java -jar GenomeAnalysisTK.jar VariantFiltration and java -jar GenomeAnalysisTK.jar SelectVariants --exclude-filtered). Set the gatk_site_filters parameter to 'NULL' to turn off this filter.

filterRegions

The filterRegions process removes low-reliability regions (repeat regions, indel-affected sites, and regions of non-unique mappability) from the site-filtered VCFs from the filterSites process. The low-reliability regions are specified in the output merged bed from the simplifyBed process. On the first pass, the process uses BEDTools [13] intersect on the site-filtered VCF (bedtools intersect -a <site-filtered_vcf> -b <out_cat.bed>). Should this fail (e.g. due to a compression issue), the process repeats using zcat to uncompress the VCF and pass it via stdin to BEDTools. If neither BEDTools run is successful, the process attempts to removed the filtered regions using BCFtools view (bcftools view -R <out_cat.bed> -Ob -o tmp.bcf), BCFtools isec (bcftools isec -C <site-filtered_vcf> <tmp.bcf>) and BCFtools view (bcftools view -T <isec_out> <site-filtered_vcf>) [11]. If the BCFtools attempt fails, the process defaults to using VCFtools (vcftools --recode --exclude-bed <out_cat.bed>). The resulting VCFs are compressed using gzip.

calcDNMRate

Using the region-filtered VCFs output from the filterRegions process, the calcDNMRate process calculates the per-chromosome mutation rate using calc_denovo_mutation_rate.rb and the options specified in the config file (calc_denovo_mutation_rate.rb -i <chr.vcf> -s <sire> -d <dam> <denovo_mutation_options> > chr.log)).

summarizeDNM

Using summarize_denovo.rb, the summarizeDNM process combines the per-chromosome mutation rate results from the calcDNMRate process to obtain the genomic mutation rate. It outputs both a log summarizing the candidate DNMs and mutation rate estimates and a VCF of the candidate DNMs.

sanityCheckLogs

Retained site counts are sanity-checked and logged after each filtration step using logstats.sh.

sanityCheckLogsVcftools

Variant of sanityCheckLogs that removes too short contigs after VCFtools site filtration.

sanityCheckLogsGatk

Variant of sanityCheckLogs that removes too short contigs after GATK site filtration.

sanityCheckLogsRegions

Variant of sanityCheckLogs that removes too short contigs after region filtration.

generateSummaryStats

Using dnm_summary_stats.rb, the generateSummaryStats process calculates the numbers of sites retained after each filtration stage in the RatesTools pipeline. It also calculate the number of DNMs of each mutation class. It can then optionally remove clumps of candidate mutations (within a specified window) which are more likely to represent misalignments or complex mutations than multiple single-nucleotide substitutions. If multiple siblings are sequenced, it identifies and removes overlapping candidate mutations. The script then recalculates mutation rate estimates and confidence intervals after removing the clumped and overlapping sites.

References

  1. Li, H. (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv, 1303.3997v2.
  2. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., 1000 Genome Project Data Processing Subgroup (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25, 2078-2079. DOI: 10.1093/bioinformatics/btp352.
  3. Tarasov, A., Vilella, A.J., Cuppen, E., Nijman, I.J., Prins, P. (2015) Sambamba: fast processing of NGS alignment formats. Bioinformatics, 31, 2032–2034. DOI: 10.1093/bioinformatics/btv098.
  4. Broad Institute (2020). Picard v. 2.23.8 (https://broadinstitute.github.io/picard/).
  5. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., DePristo, M.A. (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res, 20, 1297-1303. DOI: 10.1101/gr.107524.110.
  6. Besenbacher, S., Hvilsom, C., Marques-Bonet, T., Mailund, T., Schierup, M.H. (2019) Direct estimation of mutations in great apes reconciles phylogenetic dating. Nat Ecol Evol, 3, 286-292. DOI: 10.1038/s41559-018-0778-x.
  7. Pockrandt, C., Alzamel, M., Iliopoulos, C.S., Reinert, K. (2020) GenMap: ultra-fast computation of genome mappability. Bioinformatics, 36, 3687–3692, DOI: 10.1093/bioinformatics/btaa222.
  8. Smit, A.F.A., Hubley, R., Green, P. (2013-2015) RepeatMasker Open-4.0. (http://www.repeatmasker.org).
  9. Flynn, J.M., Hubley, R., Goubert, C., Rosen, J. Clark,. A.G., Feschotte, C., Smit, A.F. (2020) RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci U S A, 117, 9451-9457. DOI: 10.1073/pnas.1921046117.
  10. Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., McVean, G., Durbin, R. (2011) The variant call format and VCFtools. Bioinformatics, 27, 2156–2158. DOI: 10.1093/bioinformatics/btr330.
  11. Martin, M., Patterson, M., Garg, S., Fischer, S.O., Pisanti, N., Klau, G.W., Schoenhuth, A., Marschall, T. (2016) WhatsHap: fast and accurate read-based phasing. BioRxiv, DOI: 10.1101/085050.
  12. Garg, S., Martin, M., Marschall, T. (2016) Read-based phasing of related individuals. Bioinformatics, 32, i234-i242, DOI: 10.1093/bioinformatics/btw276.
  13. Quinlan, A.R., Hall, I.M. (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26, 841-842, DOI: 10.1093/bioinformatics/btq0333.
  14. Danecek, P., Bonfield, J.K., Liddle, J., Marshall, J., Ohan, V., Pollard, M.O., Whitwham, A., Keane, T., McCarthy, S.A., Davies, R.M., Li, H. (2021) Twelve years of SAMtools and BCFtools. GigaScience, 10, giab008. DOI: 10.1093/gigascience/giab008.