Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CUT&RUN tutorial #5104

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 25 additions & 45 deletions topics/epigenetics/tutorials/cut_and_run/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
layout: tutorial_hands_on

title: CUT&RUN data analysis
draft: true
zenodo_link: https://zenodo.org/record/3862793
questions:
- Which binding motif has the transcription factor GATA1?
Expand Down Expand Up @@ -131,7 +130,7 @@ We first have to check if our data contains adapter sequences that we have to re
> - *"Input Collection"*: `2 PE fastqs`
>
> 2. {% tool [FastQC](toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.73+galaxy0) %} with the following parameters:
> - *"Short read data from your current history"*: Choose the output of **Flatten collection** {% icon tool %}selected as **Dataset collection**.
> - {% icon param-collection %} *"Raw read data from your current history"*: Choose the output of **Flatten collection** {% icon tool %} selected as **Dataset collection**.
> 3. Inspect the web page output of **FastQC** {% icon tool %} for the `Rep1_forward` sample. Check what adapters are found at the end of the reads.
>
> > <question-title></question-title>
Expand Down Expand Up @@ -159,7 +158,7 @@ We first have to check if our data contains adapter sequences that we have to re
> > >
> > > 4. **Adapter Content**
> > >
> > > Our data contains an adapter that we still have to remove.
> > > Our data contains an adapter (Illumina Universal Adapter) that we still have to remove.
> > >
> > {: .solution}
> >
Expand All @@ -179,7 +178,7 @@ The FastQC report pointed out that we have in our data some standard Illumina ad

> <hands-on-title>Task description</hands-on-title>
>
> 1. {% tool [Trim Galore!]( https://toolshed.g2.bx.psu.edu/view/bgruening/trim_galore/cd7e644cae1d) %} with the following parameters:
> 1. {% tool [Trim Galore!](toolshed.g2.bx.psu.edu/repos/bgruening/trim_galore/trim_galore/0.6.7+galaxy0) %} with the following parameters:
> - *"Is this library paired- or single-end?"*: `Paired Collection`
> - *"Select a paired collection"*: select `2 PE fastqs`
> - In *"Adapter sequence to be trimmed"*: `Illumina universal`
Expand All @@ -188,7 +187,7 @@ The FastQC report pointed out that we have in our data some standard Illumina ad
> - In *"Discard reads that became shorter than length N"*: `15`
> - In *"Generate a report file"*: `Yes`
>
> 2. Click on the {% icon galaxy-eye %} (eye) icon of the report for `Rep1` and read the first lines.
> 2. Click on the {% icon galaxy-eye %} (eye) icon of the report for `Rep1` and look for the info.
{: .hands_on}

> <question-title></question-title>
Expand All @@ -198,8 +197,8 @@ The FastQC report pointed out that we have in our data some standard Illumina ad
>
> > <solution-title></solution-title>
> >
> > 1. ~55% and ~57%
> > 2. 100%
> > 1. ~55% for Read 1 and ~57% for Read 2
> > 2. The last line indicates that 3.5% of pairs have been removed.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heylf could you confirm it makes sense (the output of bowtie2 does not have 300k reads as input).

> >
> {: .solution}
>
Expand All @@ -216,32 +215,13 @@ The hg38 version of the human genome contains [alternate loci](https://www.ncbi.
are present both in the canonical chromosome and on its alternate loci. The reads that map to these regions would map twice. To be able to filter reads falling into
repetitive regions but keep reads falling into regions present in alternate loci, we will map on the Canonical version of hg38 (only the chromosome with numbers, chrX, chrY, and chrM).

> <comment-title>Dovetailing</comment-title>
> We will allow dovetailing of read pairs with Bowtie2. This is because adapters are removed by Cutadapt only when at least 3 bases match the adapter sequence, so it is possible that after trimming a read can contain 1-2 bases of adapter and go beyond it's mate start site. This occurs especially for CUT&RUN because the read length is quite short. Bowtie thus discards reads such as:
> ```
> ---------------------Mate 1--------------------------------->
> <---------------------Mate 2----------------------
> ```
>
> or
>
> ```
> ---------------------Mate 1--------------------------------->
> <---------------------Mate 2---------------------------------
> ```
>
> This is what we call dovetailing and we want to consider this pair as a valid concordant alignment.
{: .comment}


Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to remove this part because the tutorial is using TrimGalore and removes adapters even if they are a single base pair.

> <hands-on-title>Mapping reads to reference genome</hands-on-title>
>
> 1. {% tool [Bowtie2](toolshed.g2.bx.psu.edu/repos/devteam/bowtie2/bowtie2/2.4.5+galaxy1) %} with the following parameters:
> 1. {% tool [Bowtie2](toolshed.g2.bx.psu.edu/repos/devteam/bowtie2/bowtie2/2.5.3+galaxy1) %} with the following parameters:
> - *"Is this single or paired library"*: `Paired-end Dataset Collection`
> - *"FASTQ Paired Dataset*: select the output of **Trim Galore!** {% icon tool %} *"paired reads"*
> - *"Do you want to set paired-end options?"*: `Yes`
> - *"Set the maximum fragment length for valid paired-end alignments"*: `1000`
> - *"Allow mate dovetailing"*: `Yes`
> - *"Will you select a reference genome from your history or use a built-in index?"*: `Use a built-in genome index`
> - *"Select reference genome"*: `Human (Homo sapiens): hg38 Canonical`
> - *"Set read groups information?"*: `Do not set`
Expand All @@ -264,7 +244,7 @@ repetitive regions but keep reads falling into regions present in alternate loci
>
> > <solution-title></solution-title>
> >
> > 41.46+57.51=98.97%
> > 36.47+62.39=98.86%
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why these number changed (maybe the version of bowtie2)...

> >
> {: .solution}
>
Expand All @@ -280,21 +260,21 @@ so we remove these reads. We also remove reads with low mapping quality and read

> <hands-on-title>Filtering of uninformative reads</hands-on-title>
>
> 1. {% tool [Filter BAM datasets on a variety of attributes](toolshed.g2.bx.psu.edu/repos/devteam/bamtools_filter/bamFilter/2.5.1+galaxy0) %} with the following parameters:
> 1. {% tool [Filter BAM datasets on a variety of attributes](toolshed.g2.bx.psu.edu/repos/devteam/bamtools_filter/bamFilter/2.5.2+galaxy2) %} with the following parameters:
> - {% icon param-collection %} *"BAM dataset(s) to filter"*: Select the output of **Bowtie2** {% icon tool %} *"alignments"*
> - In *"Condition"*:
> - {% icon param-repeat %} *"Insert Condition"*
> - {% icon param-repeat %} *"Condition"*
> - In *"Filter"*:
> - {% icon param-repeat %} *"Insert Filter"*
> - *"Select BAM property to filter on"*: `mapQuality`
> - *"Filter"*
> - *"Select BAM property to filter on"*: `Mapping quality`
> - *"Filter on read mapping quality (phred scale)"*: `>=30`
> - {% icon param-repeat %} *"Insert Filter"*
> - *"Select BAM property to filter on"*: `Proper Pair`
> - *"Select properly paired reads"*: `Yes`
> - {% icon param-repeat %} *"Insert Filter"*
> - *"Select BAM property to filter on"*: `Reference name of the read`
> - *"Filter on the reference name for the read"*: `!chrM`
> - *"Would you like to set rules?"*: `No`
> - *"Would you like to set rules?"*: `False`
>
>
> 2. Click on the input and the output BAM files of the filtering step. Check the size of the files.
Expand All @@ -308,7 +288,7 @@ so we remove these reads. We also remove reads with low mapping quality and read
>
> > <solution-title></solution-title>
> >
> > 1. The original BAM files are 14-15 MB, the filtered ones are 5.6 and 6.8 MB. Approximately half of the alignments were removed.
> > 1. The original BAM files are 14-15 MB, the filtered ones are 5.8 and 7.1 MB. Approximately half of the alignments were removed.
> >
> > 2. You should modify the mapQuality criteria and decrease the threshold.
> >
Expand All @@ -322,7 +302,7 @@ Because of the PCR amplification, there might be read duplicates (different read

> <hands-on-title>Remove duplicates</hands-on-title>
>
> 1. {% tool [MarkDuplicates](toolshed.g2.bx.psu.edu/repos/devteam/picard/picard_MarkDuplicates/2.18.2.3) %} with the following parameters:
> 1. {% tool [MarkDuplicates](toolshed.g2.bx.psu.edu/repos/devteam/picard/picard_MarkDuplicates/3.1.1.0) %} with the following parameters:
> - {% icon param-collection %} *"Select SAM/BAM dataset or dataset collection"*: Select the output of **Filter** {% icon tool %} *"BAM"*
> - *"If true do not write duplicates to the output file instead of writing them with appropriate flags set"*: `Yes`
>
Expand Down Expand Up @@ -363,8 +343,8 @@ Because of the PCR amplification, there might be read duplicates (different read
>
> > <solution-title></solution-title>
> >
> > 1. 81466
> > 2. 983
> > 1. 81460
> > 2. 982
> >
> {: .solution}
>
Expand All @@ -378,7 +358,7 @@ too much compared to the diversity of the library you generated. Consequently, l
> <hands-on-title>Check Adapter Removal with FastQC</hands-on-title>
>
> 1. {% tool [FastQC](toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.73+galaxy0) %} with the following parameters:
> - {% icon param-collection %} *"Short read data from your current history"*: select the output of **MarkDuplicates** .
> - {% icon param-collection %} *"Raw read data from your current history"*: select the output of **MarkDuplicates** BAM.
>
> 2. Click on the {% icon galaxy-eye %} (eye) icon of the report and read the first lines.
{: .hands_on}
Expand Down Expand Up @@ -445,7 +425,7 @@ We convert the BAM file to BED format because when we set the extension size in

> <hands-on-title>Convert the BAM to BED</hands-on-title>
>
> 1. {% tool [bedtools BAM to BED converter](toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_bamtobed/2.30.0+galaxy2) %} with the following parameters:
> 1. {% tool [bedtools BAM to BED converter](toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_bamtobed/2.31.1+galaxy0) %} with the following parameters:
> - {% icon param-collection %} *"Convert the following BAM file to BED"*: Select the output of **MarkDuplicates** {% icon tool %} BAM output
>
{: .hands_on}
Expand All @@ -455,7 +435,7 @@ We call peaks with MACS2. To get the coverage centered on the 5' extended 100bp

> <hands-on-title>Call peaks with MACS2</hands-on-title>
>
> 1. {% tool [MACS2 callpeak](toolshed.g2.bx.psu.edu/repos/iuc/macs2/macs2_callpeak/2.2.7.1+galaxy0) %} with the following parameters:
> 1. {% tool [MACS2 callpeak](toolshed.g2.bx.psu.edu/repos/iuc/macs2/macs2_callpeak/2.2.9.1+galaxy0) %} with the following parameters:
> - *"Are you pooling Treatment Files?"*: `No`
> - {% icon param-collection %} Select the output of **bedtools BAM to BED** converter {% icon tool %}
> - *"Do you have a Control File?"*: `No`
Expand Down Expand Up @@ -512,7 +492,7 @@ We can remove such peaks if we simply overlap the two peak files and consequentl

> <hands-on-title>Select robust GATA1 peaks:</hands-on-title>
>
> 1. {% tool [bedtools Intersect intervals find overlapping intervals in various ways](toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_intersectbed/2.30.0+galaxy1) %} with the following parameters:
> 1. {% tool [bedtools Intersect intervals find overlapping intervals in various ways](toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_intersectbed/2.31.1+galaxy0) %} with the following parameters:
> - {% icon param-file %} *"File A to intersect with B"*: Select `Rep1`
> - *"Combined or separate output files"*: `One output file per 'input B' file`
> - {% icon param-file %} *"File B to intersect with A"*: Select `Rep2`
Expand Down Expand Up @@ -571,9 +551,9 @@ We can further remove some noise with a positive control, that is why we have do
> >
> > > <solution-title></solution-title>
> > >
> > > 1. 2,409 peaks
> > > 2. 454 peaks
> > > 3. Our precision is ~ 84%. A high precision is an indicator that we can predict true binding regions with high confidence.
> > > 1. 2,865 peaks (this is the number of lines of our last output)
> > > 2. About 4000 peaks (Rep1 and Rep2 has about 6-7 thousand peaks each)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heylf Here I am not sure of what you consider false positive but I guess they are peaks in a single replicate which are not overlapping the second replicate.

> > > 3. Our precision is ~ 33%. A high precision is an indicator that we can predict true binding regions with high confidence.
> > {: .solution}
> >
> {: .question}
Expand Down Expand Up @@ -622,7 +602,7 @@ Let's find out the sequence motifs of the TF GATA1. Studies have revealed that G
> >
> > > <solution-title></solution-title>
> > >
> > > 1. !!!! Where is this info? !!!
> > > 1. Where is this info?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could not find this info

> > > 2. E-value = 5.0e-383
> > {: .solution}
> >
Expand Down
Loading