Skip to content
Sabrina Krakau edited this page Jun 4, 2017 · 41 revisions

Welcome to the PureCLIP wiki!

PureCLIP is a tool to detect protein-RNA interaction footprints from single-nucleotide CLIP-seq data, such as iCLIP and eCLIP.

Before you start, please check the

and if necessary

Preparing sample files for minimal example

As a first example you can download preprocessed data from ENCODE, and filter the paired-end data to keep only R2:

wget -O pum2.aligned.prepro.bam https://www.encodeproject.org/files/ENCFF280ONP/@@download/ENCFF280ONP.bam
samtools view -hb -f 130 pum2.aligned.prepro.bam -o pum2.aligned.prepro.R2.bam
samtools index pum2.aligned.prepro.R2.bam

As a reference genome, you can download e.g. the FASTA file from ENSEMBL:

wget -O ref.GRCh38.fa.gz ftp://ftp.ensembl.org/pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz 
gunzip ref.GRCh38.fa.gz

PureCLIP in basic mode

PureCLIP expects aligned reads as input, more precisely it assumes only reads containing information about potential truncation events: R1 for iCLIP data and R2 for eCLIP data. To run PureCLIP in basic mode, it requires BAM and BAI files, the reference genome and a specified output file:

pureclip -i pum2.aligned.prepro.R2.bam -bai pum2.aligned.prepro.R2.bam.bai -g ref.GRCh38.fa -iv '1;2;3;' -nt 10 -o PureCLIP.crosslink_sites.bed

With -iv the chromosomes (or transcripts) can be specified that will be used to learn the parameters of PureCLIPs HMM. This reduces the memory consumption and runtime. Usually, learning on a small subset of the chromosomes, e.g. Chr1-3, does not impair the results noticeable. However, in the case of very sparse data this can be adjusted. With -nt the number threads for parallelization can be specified.

PureCLIP incorporating input control data

To run PureCLIP with input control data, additionally hand over the (preprocessed) BAM file from the input experiment with -ibam and the associated BAI file with -ibai:

pureclip -i pum2.aligned.prepro.R2.bam -bai pum2.aligned.prepro.R2.bam.bai -g ref.GRCh38.fa -iv '1;2;3;' -nt 10 -o PureCLIP.crosslink_sites.cov_inputSignal.bed -g1g2k -ibam pum2_input.prepro.R2.bam -ibai pum2_input.prepro.R2.bam.bai

The parameter -g1g2k constraints the shape parameter of the second gamma distribution to be smaller or equal than the shape parameter of the first gamma distribution.

PureCLIP incorporating CL-motif scores

To incorporate CL-motifs into the model of PureCLIP, first we need to compute position-wise CL-motif scores, indicating the positions CL-affinity. You can use the provided precompiled list of common_CL-motifs. If you want to compute the CL-motifs specific to the used eCLIP experiment, please have a look here. Assume we have give a set of CL-motifs, we need to use FIMO (Grant et. al, 2011) to compute motif occurrences associated with a score within your reference. The following script first retrieves reference regions covered by the target experiment, then runs FIMO to compute position-wise CL-motif match scores and chooses for each position the motif with the highest score:

export BEDTOOLS=/path/to/bedtools        # if not specified, PATH is searched
export FIMO=/path/to/fimo                # if not specified, PATH is searched
export WINEXTRACT=/path/to/winextract    # built together with PureCLIP
compute_CLmotif_scores.sh ref.fa pum2.aligned.prepro.R2.bam motifs.xml motifs.txt fimo_clmotif_occurences.bed

The computed scores are then handed over to PureCLIP together with the parameter -nim 4, indicating that scores with associated motif IDs 1-4 will be used (default: only scores with motif ID 1 will be used).

pureclip -i pum2.aligned.prepro.R2.bam -bai pum2.aligned.prepro.R2.bam.bai -g ref.GRCh38.fa -o PureCLIP.crosslink_sites.cov_CLmotifs.bed -nt 10 -iv '1;2;3;' -nim 4 -fis fimo_clmotif_occurences.bed

PureCLIPs output

The main output of PureCLIP is a BED6 file, containing individual crosslink sites together with a score:

  1. chr: Name of the chromosome or scaffold.
  2. start: Position of crosslink site.
  3. end: Position behind crosslink site (start+1).
  4. state: '3'
  5. score: log posterior probability ratio of the first and second likely state.
  6. strand: + or -

Optionally, if an output file for binding regions is specified with -or, individual crosslink sites with a distance <= d (specified with -dm, default 8 bp) are merged and given out in a separate BED6 file:

  1. chr: Name of the chromosome or scaffold.
  2. start: Start position, position of first crosslink site.
  3. end: End position, position behind last crosslink site.
  4. indiv. scores: 'score1;score2;score3;'
  5. score: Sum of log posterior probability ratio scores.
  6. strand: + or -
Clone this wiki locally