Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gatk4/markduplicates - pipe uncompressed output to speed up CRAM writing #7497

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 13 additions & 12 deletions modules/nf-core/gatk4/markduplicates/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -25,13 +25,14 @@ process GATK4_MARKDUPLICATES {

script:
def args = task.ext.args ?: ''
prefix = task.ext.prefix ?: "${meta.id}.bam"

// If the extension is CRAM, then change it to BAM
prefix_bam = prefix.tokenize('.')[-1] == 'cram' ? "${prefix.substring(0, prefix.lastIndexOf('.'))}.bam" : prefix
def args2 = task.ext.args2 ?: ''
def prefix = task.ext.prefix ?: "${meta.id}.bam"
def output_format = prefix.tokenize('.')[-1] == 'cram' ? "cram" : "bam"
def output_flag = output_format == 'cram' ? "-Ch" : "-bh"

def input_list = bam.collect{"--INPUT $it"}.join(' ')
def reference = fasta ? "--REFERENCE_SEQUENCE ${fasta}" : ""
def reference2 = fasta ? "-T ${fasta}" : ""

def avail_mem = 3072
if (!task.memory) {
Expand All @@ -40,24 +41,24 @@ process GATK4_MARKDUPLICATES {
avail_mem = (task.memory.mega*0.8).intValue()
}

if (!fasta && output_format == 'cram') error "Fasta reference is required for CRAM output"

// Using samtools and not Markduplicates to compress to CRAM speeds up computation:
// https://medium.com/@acarroll.dna/looking-at-trade-offs-in-compression-levels-for-genomics-tools-eec2834e8b94
"""
gatk --java-options "-Xmx${avail_mem}M -XX:-UsePerfData" \\
MarkDuplicates \\
$input_list \\
--OUTPUT ${prefix_bam} \\
--COMPRESSION_LEVEL 0 \\
--OUTPUT /dev/stdout \\
--METRICS_FILE ${prefix}.metrics \\
--TMP_DIR . \\
${reference} \\
$args
$args \\
| samtools view $args2 ${output_flag} ${reference2} -o ${prefix}

# If cram files are wished as output, the run samtools for conversion
if [[ ${prefix} == *.cram ]]; then
samtools view -Ch -T ${fasta} -o ${prefix} ${prefix_bam}
rm ${prefix_bam}
samtools index ${prefix}
fi
# Create index for BAM/CRAM
samtools index ${prefix}

cat <<-END_VERSIONS > versions.yml
"${task.process}":
Expand Down