-
Notifications
You must be signed in to change notification settings - Fork 9
Home
Welcome to the PureCLIP wiki!
PureCLIP is a tool to detect protein-RNA interaction footprints from single-nucleotide CLIP-seq data, such as iCLIP and eCLIP.
Before you start, please check the
In the following tutorial we will describe for each PureCLIP mode how to run a minimal example. For the analysis of your own data, please have a look how to:
As a first example you can download preprocessed data from ENCODE, and filter the paired-end data to keep only R2:
wget -O aligned.prepro.bam https://www.encodeproject.org/files/ENCFF280ONP/@@download/ENCFF280ONP.bam
samtools view -hb -f 130 aligned.prepro.bam -o aligned.prepro.R2.bam
samtools index aligned.prepro.R2.bam
Additionally, we need the corresponding reference genome:
wget -O ref.hg19.fa.gz https://www.encodeproject.org/files/female.hg19/@@download/female.hg19.fasta.gz
gunzip ref.hg19.fa.gz
To run PureCLIP in basic mode, it requires BAM and BAI files, the reference genome and a specified output file:
pureclip -i aligned.prepro.R2.bam -bai aligned.prepro.R2.bam.bai -g ref.hg19.fa -iv 'chr1;chr2;chr3;' -nt 10 -o PureCLIP.crosslink_sites.bed
With -iv
the chromosomes (or transcripts) can be specified that will be used to learn the parameters of PureCLIPs HMM.
This reduces the memory consumption and runtime.
Usually, learning on a small subset of the chromosomes, e.g. Chr1-3, does not impair the results noticeable.
However, in the case of very sparse data this can be adjusted.
With -nt
the number threads for parallelization can be specified.
Beside the target PUM2 eCLIP data, we download the corresponding preprocessed input data from ENCODE, and again filter the paired-end data to keep only R2:
wget -O input.aligned.prepro.bam https://www.encodeproject.org/files/ENCFF043ERY/@@download/ENCFF043ERY.bam
samtools view -hb -f 130 input.aligned.prepro.bam -o input.aligned.prepro.R2.bam
samtools index input.aligned.prepro.R2.bam
To run PureCLIP with input control data, additionally hand over the (preprocessed) BAM file from the input experiment with -ibam
and the associated BAI file with -ibai
:
pureclip -i aligned.prepro.R2.bam -bai aligned.prepro.R2.bam.bai -g ref.hg19.fa -iv 'chr1;chr2;chr3;' -nt 10 -o PureCLIP.crosslink_sites.cov_inputSignal.bed -g1g2k -ibam input.aligned.prepro.R2.bam -ibai input.aligned.prepro.R2.bam.bai
The parameter -g1g2k
constraints the shape parameter of the second gamma distribution to be smaller or equal than the shape parameter of the first gamma distribution.
To incorporate CL-motifs into the model of PureCLIP, first we need to compute position-wise CL-motif scores, indicating the positions CL-affinity. You can use the provided precompiled list of common_CL-motifs:
wget -O motifs.txt https://github.com/skrakau/PureCLIP_data/blob/master/common_CL-motifs/dreme.w10.k4.txt
wget -O motifs.xml https://github.com/skrakau/PureCLIP_data/blob/master/common_CL-motifs/dreme.w10.k4.xml
If you want to compute the CL-motifs specific to the used eCLIP experiment, please have a look here. Assume we have given a set of CL-motifs, we use FIMO (Grant et. al, 2011) to compute motif occurrences associated with a score within the reference sequence. The following provided script (distributed with PureCLIP) first retrieves reference regions covered by the target experiment, then runs FIMO to compute position-wise CL-motif match scores within such regions and chooses for each position the motif with the highest score:
export BEDTOOLS=/path/to/bedtools # if not specified, PATH is searched
export FIMO=/path/to/fimo # if not specified, PATH is searched
export WINEXTRACT=/path/to/winextract # built together with PureCLIP
compute_CLmotif_scores.sh ref.hg19.fa aligned.prepro.R2.bam motifs.xml motifs.txt fimo_clmotif_occurences.bed
The resulting CL-motif scores are written to fimo_clmotif_occurences.bed
.
The computed scores are then handed over to PureCLIP together with the parameter -nim 4
, indicating that scores with associated motif IDs 1-4 will be used (default: only scores with motif ID 1 are used).
pureclip -i aligned.prepro.R2.bam -bai aligned.prepro.R2.bam.bai -g ref.hg19.fa -o PureCLIP.crosslink_sites.cov_CLmotifs.bed -nt 10 -iv 'chr1;chr2;chr3;' -nim 4 -fis fimo_clmotif_occurences.bed
The main output of PureCLIP is a BED6 file containing individual crosslink sites associated with a score:
- chr: Name of the chromosome or scaffold.
- start: Position of crosslink site.
- end: Position behind crosslink site (start+1).
- state: '3'
- score: log posterior probability ratio of the first and second likely state.
- strand: + or -
Optionally, if an output file for binding regions is specified with -or
, individual crosslink sites with a distance <= d (specified with -dm
, default 8 bp) are merged and given out in a separate BED6 file:
- chr: Name of the chromosome or scaffold.
- start: Start position, position of first crosslink site.
- end: End position, position behind last crosslink site.
- indiv. scores: 'score1;score2;score3;'
- score: Sum of log posterior probability ratio scores.
- strand: + or -
For a full list of user options, please type:
pureclip --help