Skip to content
Sabrina Krakau edited this page Jun 5, 2017 · 41 revisions

Welcome to the PureCLIP wiki!

PureCLIP is a tool to detect protein-RNA interaction footprints from single-nucleotide CLIP-seq data, such as iCLIP and eCLIP.

Before you start, please check the

In the following tutorial we will describe for each PureCLIP mode how to run a minimal example. For the analysis of your own data, please have a look how to:

PureCLIP in basic mode

Generate sample files for minimal example

As a first example you can download preprocessed data from ENCODE, and filter the paired-end data to keep only R2:

wget -O aligned.prepro.bam https://www.encodeproject.org/files/ENCFF280ONP/@@download/ENCFF280ONP.bam
samtools view -hb -f 130 aligned.prepro.bam -o aligned.prepro.R2.bam
samtools index aligned.prepro.R2.bam

Additionally, we need the corresponding reference genome:

wget -O ref.hg19.fa.gz https://www.encodeproject.org/files/female.hg19/@@download/female.hg19.fasta.gz 
gunzip ref.hg19.fa.gz

PureCLIP

To run PureCLIP in basic mode, it requires BAM and BAI files, the reference genome and a specified output file:

pureclip -i aligned.prepro.R2.bam -bai aligned.prepro.R2.bam.bai -g ref.hg19.fa -iv 'chr1;chr2;chr3;' -nt 10 -o PureCLIP.crosslink_sites.bed

With -iv the chromosomes (or transcripts) can be specified that will be used to learn the parameters of PureCLIPs HMM. This reduces the memory consumption and runtime. Usually, learning on a small subset of the chromosomes, e.g. Chr1-3, does not impair the results noticeable. However, in the case of very sparse data this can be adjusted. With -nt the number threads for parallelization can be specified.


PureCLIP incorporating input control data

Generate sample files for minimal example

Beside the target PUM2 eCLIP data, we download the corresponding preprocessed input data from ENCODE, and again filter the paired-end data to keep only R2:

wget -O input.aligned.prepro.bam https://www.encodeproject.org/files/ENCFF043ERY/@@download/ENCFF043ERY.bam
samtools view -hb -f 130 input.aligned.prepro.bam -o input.aligned.prepro.R2.bam
samtools index input.aligned.prepro.R2.bam

PureCLIP

To run PureCLIP with input control data, additionally hand over the (preprocessed) BAM file from the input experiment with -ibam and the associated BAI file with -ibai:

pureclip -i aligned.prepro.R2.bam -bai aligned.prepro.R2.bam.bai -g ref.hg19.fa -iv 'chr1;chr2;chr3;' -nt 10 -o PureCLIP.crosslink_sites.cov_inputSignal.bed -g1g2k -ibam input.aligned.prepro.R2.bam -ibai input.aligned.prepro.R2.bam.bai

The parameter -g1g2k constraints the shape parameter of the second gamma distribution to be smaller or equal than the shape parameter of the first gamma distribution.


PureCLIP incorporating CL-motif scores

Generate sample files for minimal example

To incorporate CL-motifs into the model of PureCLIP, first we need to compute position-wise CL-motif scores, indicating the positions CL-affinity. You can use the provided precompiled list of common_CL-motifs:

wget -O motifs.txt https://github.com/skrakau/PureCLIP_data/blob/master/common_CL-motifs/dreme.w10.k4.txt
wget -O motifs.xml https://github.com/skrakau/PureCLIP_data/blob/master/common_CL-motifs/dreme.w10.k4.xml

If you want to compute the CL-motifs specific to the used eCLIP experiment, please have a look here. Assume we have given a set of CL-motifs, we use FIMO (Grant et. al, 2011) to compute motif occurrences associated with a score within the reference sequence. The following provided script (distributed with PureCLIP) first retrieves reference regions covered by the target experiment, then runs FIMO to compute position-wise CL-motif match scores within such regions and chooses for each position the motif with the highest score:

export BEDTOOLS=/path/to/bedtools        # if not specified, PATH is searched
export FIMO=/path/to/fimo                # if not specified, PATH is searched
export WINEXTRACT=/path/to/winextract    # built together with PureCLIP
compute_CLmotif_scores.sh ref.hg19.fa aligned.prepro.R2.bam motifs.xml motifs.txt fimo_clmotif_occurences.bed

The resulting CL-motif scores are written to fimo_clmotif_occurences.bed.

PureCLIP

The computed scores are then handed over to PureCLIP together with the parameter -nim 4, indicating that scores with associated motif IDs 1-4 will be used (default: only scores with motif ID 1 are used).

pureclip -i aligned.prepro.R2.bam -bai aligned.prepro.R2.bam.bai -g ref.hg19.fa -o PureCLIP.crosslink_sites.cov_CLmotifs.bed -nt 10 -iv 'chr1;chr2;chr3;' -nim 4 -fis fimo_clmotif_occurences.bed

PureCLIPs output

The main output of PureCLIP is a BED6 file containing individual crosslink sites associated with a score:

  1. chr: Name of the chromosome or scaffold.
  2. start: Position of crosslink site.
  3. end: Position behind crosslink site (start+1).
  4. state: '3'
  5. score: log posterior probability ratio of the first and second likely state.
  6. strand: + or -

Optionally, if an output file for binding regions is specified with -or, individual crosslink sites with a distance <= d (specified with -dm, default 8 bp) are merged and given out in a separate BED6 file:

  1. chr: Name of the chromosome or scaffold.
  2. start: Start position, position of first crosslink site.
  3. end: End position, position behind last crosslink site.
  4. indiv. scores: 'score1;score2;score3;'
  5. score: Sum of log posterior probability ratio scores.
  6. strand: + or -

User options

For a full list of user options, please type:

pureclip --help
Clone this wiki locally