From 76e4319426d1c0d3bc636c3901ed49a2a71aa72b Mon Sep 17 00:00:00 2001 From: Lucille Delisle Date: Sat, 29 Jun 2024 20:00:52 +0200 Subject: [PATCH 1/2] Update CUT&RUN tutorial --- .../tutorials/cut_and_run/tutorial.md | 69 +++++++------------ 1 file changed, 25 insertions(+), 44 deletions(-) diff --git a/topics/epigenetics/tutorials/cut_and_run/tutorial.md b/topics/epigenetics/tutorials/cut_and_run/tutorial.md index 02fe29f9bc160b..b3523b99f6a0bc 100644 --- a/topics/epigenetics/tutorials/cut_and_run/tutorial.md +++ b/topics/epigenetics/tutorials/cut_and_run/tutorial.md @@ -131,7 +131,7 @@ We first have to check if our data contains adapter sequences that we have to re > - *"Input Collection"*: `2 PE fastqs` > > 2. {% tool [FastQC](toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.73+galaxy0) %} with the following parameters: -> - *"Short read data from your current history"*: Choose the output of **Flatten collection** {% icon tool %}selected as **Dataset collection**. +> - {% icon param-collection %} *"Raw read data from your current history"*: Choose the output of **Flatten collection** {% icon tool %} selected as **Dataset collection**. > 3. Inspect the web page output of **FastQC** {% icon tool %} for the `Rep1_forward` sample. Check what adapters are found at the end of the reads. > > > @@ -159,7 +159,7 @@ We first have to check if our data contains adapter sequences that we have to re > > > > > > 4. **Adapter Content** > > > -> > > Our data contains an adapter that we still have to remove. +> > > Our data contains an adapter (Illumina Universal Adapter) that we still have to remove. > > > > > {: .solution} > > @@ -179,7 +179,7 @@ The FastQC report pointed out that we have in our data some standard Illumina ad > Task description > -> 1. {% tool [Trim Galore!]( https://toolshed.g2.bx.psu.edu/view/bgruening/trim_galore/cd7e644cae1d) %} with the following parameters: +> 1. {% tool [Trim Galore!](toolshed.g2.bx.psu.edu/repos/bgruening/trim_galore/trim_galore/0.6.7+galaxy0) %} with the following parameters: > - *"Is this library paired- or single-end?"*: `Paired Collection` > - *"Select a paired collection"*: select `2 PE fastqs` > - In *"Adapter sequence to be trimmed"*: `Illumina universal` @@ -188,7 +188,7 @@ The FastQC report pointed out that we have in our data some standard Illumina ad > - In *"Discard reads that became shorter than length N"*: `15` > - In *"Generate a report file"*: `Yes` > -> 2. Click on the {% icon galaxy-eye %} (eye) icon of the report for `Rep1` and read the first lines. +> 2. Click on the {% icon galaxy-eye %} (eye) icon of the report for `Rep1` and look for the info. {: .hands_on} > @@ -198,8 +198,8 @@ The FastQC report pointed out that we have in our data some standard Illumina ad > > > > > -> > 1. ~55% and ~57% -> > 2. 100% +> > 1. ~55% for Read 1 and ~57% for Read 2 +> > 2. The last line indicates that 3.5% of pairs have been removed. > > > {: .solution} > @@ -216,32 +216,13 @@ The hg38 version of the human genome contains [alternate loci](https://www.ncbi. are present both in the canonical chromosome and on its alternate loci. The reads that map to these regions would map twice. To be able to filter reads falling into repetitive regions but keep reads falling into regions present in alternate loci, we will map on the Canonical version of hg38 (only the chromosome with numbers, chrX, chrY, and chrM). -> Dovetailing -> We will allow dovetailing of read pairs with Bowtie2. This is because adapters are removed by Cutadapt only when at least 3 bases match the adapter sequence, so it is possible that after trimming a read can contain 1-2 bases of adapter and go beyond it's mate start site. This occurs especially for CUT&RUN because the read length is quite short. Bowtie thus discards reads such as: -> ``` -> ---------------------Mate 1---------------------------------> -> <---------------------Mate 2---------------------- -> ``` -> -> or -> -> ``` -> ---------------------Mate 1---------------------------------> -> <---------------------Mate 2--------------------------------- -> ``` -> -> This is what we call dovetailing and we want to consider this pair as a valid concordant alignment. -{: .comment} - - > Mapping reads to reference genome > -> 1. {% tool [Bowtie2](toolshed.g2.bx.psu.edu/repos/devteam/bowtie2/bowtie2/2.4.5+galaxy1) %} with the following parameters: +> 1. {% tool [Bowtie2](toolshed.g2.bx.psu.edu/repos/devteam/bowtie2/bowtie2/2.5.3+galaxy1) %} with the following parameters: > - *"Is this single or paired library"*: `Paired-end Dataset Collection` > - *"FASTQ Paired Dataset*: select the output of **Trim Galore!** {% icon tool %} *"paired reads"* > - *"Do you want to set paired-end options?"*: `Yes` > - *"Set the maximum fragment length for valid paired-end alignments"*: `1000` -> - *"Allow mate dovetailing"*: `Yes` > - *"Will you select a reference genome from your history or use a built-in index?"*: `Use a built-in genome index` > - *"Select reference genome"*: `Human (Homo sapiens): hg38 Canonical` > - *"Set read groups information?"*: `Do not set` @@ -264,7 +245,7 @@ repetitive regions but keep reads falling into regions present in alternate loci > > > > > -> > 41.46+57.51=98.97% +> > 36.47+62.39=98.86% > > > {: .solution} > @@ -280,13 +261,13 @@ so we remove these reads. We also remove reads with low mapping quality and read > Filtering of uninformative reads > -> 1. {% tool [Filter BAM datasets on a variety of attributes](toolshed.g2.bx.psu.edu/repos/devteam/bamtools_filter/bamFilter/2.5.1+galaxy0) %} with the following parameters: +> 1. {% tool [Filter BAM datasets on a variety of attributes](toolshed.g2.bx.psu.edu/repos/devteam/bamtools_filter/bamFilter/2.5.2+galaxy2) %} with the following parameters: > - {% icon param-collection %} *"BAM dataset(s) to filter"*: Select the output of **Bowtie2** {% icon tool %} *"alignments"* > - In *"Condition"*: -> - {% icon param-repeat %} *"Insert Condition"* +> - {% icon param-repeat %} *"Condition"* > - In *"Filter"*: -> - {% icon param-repeat %} *"Insert Filter"* -> - *"Select BAM property to filter on"*: `mapQuality` +> - *"Filter"* +> - *"Select BAM property to filter on"*: `Mapping quality` > - *"Filter on read mapping quality (phred scale)"*: `>=30` > - {% icon param-repeat %} *"Insert Filter"* > - *"Select BAM property to filter on"*: `Proper Pair` @@ -294,7 +275,7 @@ so we remove these reads. We also remove reads with low mapping quality and read > - {% icon param-repeat %} *"Insert Filter"* > - *"Select BAM property to filter on"*: `Reference name of the read` > - *"Filter on the reference name for the read"*: `!chrM` -> - *"Would you like to set rules?"*: `No` +> - *"Would you like to set rules?"*: `False` > > > 2. Click on the input and the output BAM files of the filtering step. Check the size of the files. @@ -308,7 +289,7 @@ so we remove these reads. We also remove reads with low mapping quality and read > > > > > -> > 1. The original BAM files are 14-15 MB, the filtered ones are 5.6 and 6.8 MB. Approximately half of the alignments were removed. +> > 1. The original BAM files are 14-15 MB, the filtered ones are 5.8 and 7.1 MB. Approximately half of the alignments were removed. > > > > 2. You should modify the mapQuality criteria and decrease the threshold. > > @@ -322,7 +303,7 @@ Because of the PCR amplification, there might be read duplicates (different read > Remove duplicates > -> 1. {% tool [MarkDuplicates](toolshed.g2.bx.psu.edu/repos/devteam/picard/picard_MarkDuplicates/2.18.2.3) %} with the following parameters: +> 1. {% tool [MarkDuplicates](toolshed.g2.bx.psu.edu/repos/devteam/picard/picard_MarkDuplicates/3.1.1.0) %} with the following parameters: > - {% icon param-collection %} *"Select SAM/BAM dataset or dataset collection"*: Select the output of **Filter** {% icon tool %} *"BAM"* > - *"If true do not write duplicates to the output file instead of writing them with appropriate flags set"*: `Yes` > @@ -363,8 +344,8 @@ Because of the PCR amplification, there might be read duplicates (different read > > > > > -> > 1. 81466 -> > 2. 983 +> > 1. 81460 +> > 2. 982 > > > {: .solution} > @@ -378,7 +359,7 @@ too much compared to the diversity of the library you generated. Consequently, l > Check Adapter Removal with FastQC > > 1. {% tool [FastQC](toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.73+galaxy0) %} with the following parameters: -> - {% icon param-collection %} *"Short read data from your current history"*: select the output of **MarkDuplicates** . +> - {% icon param-collection %} *"Raw read data from your current history"*: select the output of **MarkDuplicates** BAM. > > 2. Click on the {% icon galaxy-eye %} (eye) icon of the report and read the first lines. {: .hands_on} @@ -445,7 +426,7 @@ We convert the BAM file to BED format because when we set the extension size in > Convert the BAM to BED > -> 1. {% tool [bedtools BAM to BED converter](toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_bamtobed/2.30.0+galaxy2) %} with the following parameters: +> 1. {% tool [bedtools BAM to BED converter](toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_bamtobed/2.31.1+galaxy0) %} with the following parameters: > - {% icon param-collection %} *"Convert the following BAM file to BED"*: Select the output of **MarkDuplicates** {% icon tool %} BAM output > {: .hands_on} @@ -455,7 +436,7 @@ We call peaks with MACS2. To get the coverage centered on the 5' extended 100bp > Call peaks with MACS2 > -> 1. {% tool [MACS2 callpeak](toolshed.g2.bx.psu.edu/repos/iuc/macs2/macs2_callpeak/2.2.7.1+galaxy0) %} with the following parameters: +> 1. {% tool [MACS2 callpeak](toolshed.g2.bx.psu.edu/repos/iuc/macs2/macs2_callpeak/2.2.9.1+galaxy0) %} with the following parameters: > - *"Are you pooling Treatment Files?"*: `No` > - {% icon param-collection %} Select the output of **bedtools BAM to BED** converter {% icon tool %} > - *"Do you have a Control File?"*: `No` @@ -512,7 +493,7 @@ We can remove such peaks if we simply overlap the two peak files and consequentl > Select robust GATA1 peaks: > -> 1. {% tool [bedtools Intersect intervals find overlapping intervals in various ways](toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_intersectbed/2.30.0+galaxy1) %} with the following parameters: +> 1. {% tool [bedtools Intersect intervals find overlapping intervals in various ways](toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_intersectbed/2.31.1+galaxy0) %} with the following parameters: > - {% icon param-file %} *"File A to intersect with B"*: Select `Rep1` > - *"Combined or separate output files"*: `One output file per 'input B' file` > - {% icon param-file %} *"File B to intersect with A"*: Select `Rep2` @@ -571,9 +552,9 @@ We can further remove some noise with a positive control, that is why we have do > > > > > > > > -> > > 1. 2,409 peaks -> > > 2. 454 peaks -> > > 3. Our precision is ~ 84%. A high precision is an indicator that we can predict true binding regions with high confidence. +> > > 1. 2,865 peaks (this is the number of lines of our last output) +> > > 2. About 4000 peaks (Rep1 and Rep2 has about 6-7 thousand peaks each) +> > > 3. Our precision is ~ 33%. A high precision is an indicator that we can predict true binding regions with high confidence. > > {: .solution} > > > {: .question} @@ -622,7 +603,7 @@ Let's find out the sequence motifs of the TF GATA1. Studies have revealed that G > > > > > > > > -> > > 1. !!!! Where is this info? !!! +> > > 1. Where is this info? > > > 2. E-value = 5.0e-383 > > {: .solution} > > From 1b96752739e344e24879a2e27e316b668fc603a5 Mon Sep 17 00:00:00 2001 From: Lucille Delisle Date: Sat, 29 Jun 2024 20:11:55 +0200 Subject: [PATCH 2/2] remove the draft mode --- topics/epigenetics/tutorials/cut_and_run/tutorial.md | 1 - 1 file changed, 1 deletion(-) diff --git a/topics/epigenetics/tutorials/cut_and_run/tutorial.md b/topics/epigenetics/tutorials/cut_and_run/tutorial.md index b3523b99f6a0bc..b7e51fcd62b3b9 100644 --- a/topics/epigenetics/tutorials/cut_and_run/tutorial.md +++ b/topics/epigenetics/tutorials/cut_and_run/tutorial.md @@ -2,7 +2,6 @@ layout: tutorial_hands_on title: CUT&RUN data analysis -draft: true zenodo_link: https://zenodo.org/record/3862793 questions: - Which binding motif has the transcription factor GATA1?