-
Notifications
You must be signed in to change notification settings - Fork 900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update CUT&RUN tutorial #5104
base: main
Are you sure you want to change the base?
Update CUT&RUN tutorial #5104
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,7 +2,6 @@ | |
layout: tutorial_hands_on | ||
|
||
title: CUT&RUN data analysis | ||
draft: true | ||
zenodo_link: https://zenodo.org/record/3862793 | ||
questions: | ||
- Which binding motif has the transcription factor GATA1? | ||
|
@@ -131,7 +130,7 @@ We first have to check if our data contains adapter sequences that we have to re | |
> - *"Input Collection"*: `2 PE fastqs` | ||
> | ||
> 2. {% tool [FastQC](toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.73+galaxy0) %} with the following parameters: | ||
> - *"Short read data from your current history"*: Choose the output of **Flatten collection** {% icon tool %}selected as **Dataset collection**. | ||
> - {% icon param-collection %} *"Raw read data from your current history"*: Choose the output of **Flatten collection** {% icon tool %} selected as **Dataset collection**. | ||
> 3. Inspect the web page output of **FastQC** {% icon tool %} for the `Rep1_forward` sample. Check what adapters are found at the end of the reads. | ||
> | ||
> > <question-title></question-title> | ||
|
@@ -159,7 +158,7 @@ We first have to check if our data contains adapter sequences that we have to re | |
> > > | ||
> > > 4. **Adapter Content** | ||
> > > | ||
> > > Our data contains an adapter that we still have to remove. | ||
> > > Our data contains an adapter (Illumina Universal Adapter) that we still have to remove. | ||
> > > | ||
> > {: .solution} | ||
> > | ||
|
@@ -179,7 +178,7 @@ The FastQC report pointed out that we have in our data some standard Illumina ad | |
|
||
> <hands-on-title>Task description</hands-on-title> | ||
> | ||
> 1. {% tool [Trim Galore!]( https://toolshed.g2.bx.psu.edu/view/bgruening/trim_galore/cd7e644cae1d) %} with the following parameters: | ||
> 1. {% tool [Trim Galore!](toolshed.g2.bx.psu.edu/repos/bgruening/trim_galore/trim_galore/0.6.7+galaxy0) %} with the following parameters: | ||
> - *"Is this library paired- or single-end?"*: `Paired Collection` | ||
> - *"Select a paired collection"*: select `2 PE fastqs` | ||
> - In *"Adapter sequence to be trimmed"*: `Illumina universal` | ||
|
@@ -188,7 +187,7 @@ The FastQC report pointed out that we have in our data some standard Illumina ad | |
> - In *"Discard reads that became shorter than length N"*: `15` | ||
> - In *"Generate a report file"*: `Yes` | ||
> | ||
> 2. Click on the {% icon galaxy-eye %} (eye) icon of the report for `Rep1` and read the first lines. | ||
> 2. Click on the {% icon galaxy-eye %} (eye) icon of the report for `Rep1` and look for the info. | ||
{: .hands_on} | ||
|
||
> <question-title></question-title> | ||
|
@@ -198,8 +197,8 @@ The FastQC report pointed out that we have in our data some standard Illumina ad | |
> | ||
> > <solution-title></solution-title> | ||
> > | ||
> > 1. ~55% and ~57% | ||
> > 2. 100% | ||
> > 1. ~55% for Read 1 and ~57% for Read 2 | ||
> > 2. The last line indicates that 3.5% of pairs have been removed. | ||
> > | ||
> {: .solution} | ||
> | ||
|
@@ -216,32 +215,13 @@ The hg38 version of the human genome contains [alternate loci](https://www.ncbi. | |
are present both in the canonical chromosome and on its alternate loci. The reads that map to these regions would map twice. To be able to filter reads falling into | ||
repetitive regions but keep reads falling into regions present in alternate loci, we will map on the Canonical version of hg38 (only the chromosome with numbers, chrX, chrY, and chrM). | ||
|
||
> <comment-title>Dovetailing</comment-title> | ||
> We will allow dovetailing of read pairs with Bowtie2. This is because adapters are removed by Cutadapt only when at least 3 bases match the adapter sequence, so it is possible that after trimming a read can contain 1-2 bases of adapter and go beyond it's mate start site. This occurs especially for CUT&RUN because the read length is quite short. Bowtie thus discards reads such as: | ||
> ``` | ||
> ---------------------Mate 1---------------------------------> | ||
> <---------------------Mate 2---------------------- | ||
> ``` | ||
> | ||
> or | ||
> | ||
> ``` | ||
> ---------------------Mate 1---------------------------------> | ||
> <---------------------Mate 2--------------------------------- | ||
> ``` | ||
> | ||
> This is what we call dovetailing and we want to consider this pair as a valid concordant alignment. | ||
{: .comment} | ||
|
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I decided to remove this part because the tutorial is using TrimGalore and removes adapters even if they are a single base pair. |
||
> <hands-on-title>Mapping reads to reference genome</hands-on-title> | ||
> | ||
> 1. {% tool [Bowtie2](toolshed.g2.bx.psu.edu/repos/devteam/bowtie2/bowtie2/2.4.5+galaxy1) %} with the following parameters: | ||
> 1. {% tool [Bowtie2](toolshed.g2.bx.psu.edu/repos/devteam/bowtie2/bowtie2/2.5.3+galaxy1) %} with the following parameters: | ||
> - *"Is this single or paired library"*: `Paired-end Dataset Collection` | ||
> - *"FASTQ Paired Dataset*: select the output of **Trim Galore!** {% icon tool %} *"paired reads"* | ||
> - *"Do you want to set paired-end options?"*: `Yes` | ||
> - *"Set the maximum fragment length for valid paired-end alignments"*: `1000` | ||
> - *"Allow mate dovetailing"*: `Yes` | ||
> - *"Will you select a reference genome from your history or use a built-in index?"*: `Use a built-in genome index` | ||
> - *"Select reference genome"*: `Human (Homo sapiens): hg38 Canonical` | ||
> - *"Set read groups information?"*: `Do not set` | ||
|
@@ -264,7 +244,7 @@ repetitive regions but keep reads falling into regions present in alternate loci | |
> | ||
> > <solution-title></solution-title> | ||
> > | ||
> > 41.46+57.51=98.97% | ||
> > 36.47+62.39=98.86% | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know why these number changed (maybe the version of bowtie2)... |
||
> > | ||
> {: .solution} | ||
> | ||
|
@@ -280,21 +260,21 @@ so we remove these reads. We also remove reads with low mapping quality and read | |
|
||
> <hands-on-title>Filtering of uninformative reads</hands-on-title> | ||
> | ||
> 1. {% tool [Filter BAM datasets on a variety of attributes](toolshed.g2.bx.psu.edu/repos/devteam/bamtools_filter/bamFilter/2.5.1+galaxy0) %} with the following parameters: | ||
> 1. {% tool [Filter BAM datasets on a variety of attributes](toolshed.g2.bx.psu.edu/repos/devteam/bamtools_filter/bamFilter/2.5.2+galaxy2) %} with the following parameters: | ||
> - {% icon param-collection %} *"BAM dataset(s) to filter"*: Select the output of **Bowtie2** {% icon tool %} *"alignments"* | ||
> - In *"Condition"*: | ||
> - {% icon param-repeat %} *"Insert Condition"* | ||
> - {% icon param-repeat %} *"Condition"* | ||
> - In *"Filter"*: | ||
> - {% icon param-repeat %} *"Insert Filter"* | ||
> - *"Select BAM property to filter on"*: `mapQuality` | ||
> - *"Filter"* | ||
> - *"Select BAM property to filter on"*: `Mapping quality` | ||
> - *"Filter on read mapping quality (phred scale)"*: `>=30` | ||
> - {% icon param-repeat %} *"Insert Filter"* | ||
> - *"Select BAM property to filter on"*: `Proper Pair` | ||
> - *"Select properly paired reads"*: `Yes` | ||
> - {% icon param-repeat %} *"Insert Filter"* | ||
> - *"Select BAM property to filter on"*: `Reference name of the read` | ||
> - *"Filter on the reference name for the read"*: `!chrM` | ||
> - *"Would you like to set rules?"*: `No` | ||
> - *"Would you like to set rules?"*: `False` | ||
> | ||
> | ||
> 2. Click on the input and the output BAM files of the filtering step. Check the size of the files. | ||
|
@@ -308,7 +288,7 @@ so we remove these reads. We also remove reads with low mapping quality and read | |
> | ||
> > <solution-title></solution-title> | ||
> > | ||
> > 1. The original BAM files are 14-15 MB, the filtered ones are 5.6 and 6.8 MB. Approximately half of the alignments were removed. | ||
> > 1. The original BAM files are 14-15 MB, the filtered ones are 5.8 and 7.1 MB. Approximately half of the alignments were removed. | ||
> > | ||
> > 2. You should modify the mapQuality criteria and decrease the threshold. | ||
> > | ||
|
@@ -322,7 +302,7 @@ Because of the PCR amplification, there might be read duplicates (different read | |
|
||
> <hands-on-title>Remove duplicates</hands-on-title> | ||
> | ||
> 1. {% tool [MarkDuplicates](toolshed.g2.bx.psu.edu/repos/devteam/picard/picard_MarkDuplicates/2.18.2.3) %} with the following parameters: | ||
> 1. {% tool [MarkDuplicates](toolshed.g2.bx.psu.edu/repos/devteam/picard/picard_MarkDuplicates/3.1.1.0) %} with the following parameters: | ||
> - {% icon param-collection %} *"Select SAM/BAM dataset or dataset collection"*: Select the output of **Filter** {% icon tool %} *"BAM"* | ||
> - *"If true do not write duplicates to the output file instead of writing them with appropriate flags set"*: `Yes` | ||
> | ||
|
@@ -363,8 +343,8 @@ Because of the PCR amplification, there might be read duplicates (different read | |
> | ||
> > <solution-title></solution-title> | ||
> > | ||
> > 1. 81466 | ||
> > 2. 983 | ||
> > 1. 81460 | ||
> > 2. 982 | ||
> > | ||
> {: .solution} | ||
> | ||
|
@@ -378,7 +358,7 @@ too much compared to the diversity of the library you generated. Consequently, l | |
> <hands-on-title>Check Adapter Removal with FastQC</hands-on-title> | ||
> | ||
> 1. {% tool [FastQC](toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.73+galaxy0) %} with the following parameters: | ||
> - {% icon param-collection %} *"Short read data from your current history"*: select the output of **MarkDuplicates** . | ||
> - {% icon param-collection %} *"Raw read data from your current history"*: select the output of **MarkDuplicates** BAM. | ||
> | ||
> 2. Click on the {% icon galaxy-eye %} (eye) icon of the report and read the first lines. | ||
{: .hands_on} | ||
|
@@ -445,7 +425,7 @@ We convert the BAM file to BED format because when we set the extension size in | |
|
||
> <hands-on-title>Convert the BAM to BED</hands-on-title> | ||
> | ||
> 1. {% tool [bedtools BAM to BED converter](toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_bamtobed/2.30.0+galaxy2) %} with the following parameters: | ||
> 1. {% tool [bedtools BAM to BED converter](toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_bamtobed/2.31.1+galaxy0) %} with the following parameters: | ||
> - {% icon param-collection %} *"Convert the following BAM file to BED"*: Select the output of **MarkDuplicates** {% icon tool %} BAM output | ||
> | ||
{: .hands_on} | ||
|
@@ -455,7 +435,7 @@ We call peaks with MACS2. To get the coverage centered on the 5' extended 100bp | |
|
||
> <hands-on-title>Call peaks with MACS2</hands-on-title> | ||
> | ||
> 1. {% tool [MACS2 callpeak](toolshed.g2.bx.psu.edu/repos/iuc/macs2/macs2_callpeak/2.2.7.1+galaxy0) %} with the following parameters: | ||
> 1. {% tool [MACS2 callpeak](toolshed.g2.bx.psu.edu/repos/iuc/macs2/macs2_callpeak/2.2.9.1+galaxy0) %} with the following parameters: | ||
> - *"Are you pooling Treatment Files?"*: `No` | ||
> - {% icon param-collection %} Select the output of **bedtools BAM to BED** converter {% icon tool %} | ||
> - *"Do you have a Control File?"*: `No` | ||
|
@@ -512,7 +492,7 @@ We can remove such peaks if we simply overlap the two peak files and consequentl | |
|
||
> <hands-on-title>Select robust GATA1 peaks:</hands-on-title> | ||
> | ||
> 1. {% tool [bedtools Intersect intervals find overlapping intervals in various ways](toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_intersectbed/2.30.0+galaxy1) %} with the following parameters: | ||
> 1. {% tool [bedtools Intersect intervals find overlapping intervals in various ways](toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_intersectbed/2.31.1+galaxy0) %} with the following parameters: | ||
> - {% icon param-file %} *"File A to intersect with B"*: Select `Rep1` | ||
> - *"Combined or separate output files"*: `One output file per 'input B' file` | ||
> - {% icon param-file %} *"File B to intersect with A"*: Select `Rep2` | ||
|
@@ -571,9 +551,9 @@ We can further remove some noise with a positive control, that is why we have do | |
> > | ||
> > > <solution-title></solution-title> | ||
> > > | ||
> > > 1. 2,409 peaks | ||
> > > 2. 454 peaks | ||
> > > 3. Our precision is ~ 84%. A high precision is an indicator that we can predict true binding regions with high confidence. | ||
> > > 1. 2,865 peaks (this is the number of lines of our last output) | ||
> > > 2. About 4000 peaks (Rep1 and Rep2 has about 6-7 thousand peaks each) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @heylf Here I am not sure of what you consider false positive but I guess they are peaks in a single replicate which are not overlapping the second replicate. |
||
> > > 3. Our precision is ~ 33%. A high precision is an indicator that we can predict true binding regions with high confidence. | ||
> > {: .solution} | ||
> > | ||
> {: .question} | ||
|
@@ -622,7 +602,7 @@ Let's find out the sequence motifs of the TF GATA1. Studies have revealed that G | |
> > | ||
> > > <solution-title></solution-title> | ||
> > > | ||
> > > 1. !!!! Where is this info? !!! | ||
> > > 1. Where is this info? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could not find this info |
||
> > > 2. E-value = 5.0e-383 | ||
> > {: .solution} | ||
> > | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heylf could you confirm it makes sense (the output of bowtie2 does not have 300k reads as input).