-
Notifications
You must be signed in to change notification settings - Fork 9
Home
Welcome to the PureCLIP wiki!
PureCLIP is a tool to detect protein-RNA interaction footprints from single-nucleotide CLIP-seq data, such as iCLIP and eCLIP.
Before you start, please check the
and if necessary
As a first example you can download preprocessed data from ENCODE, and filter the paired-end data to keep only R2:
wget -O pum2.aligned.prepro.bam https://www.encodeproject.org/files/ENCFF280ONP/@@download/ENCFF280ONP.bam
samtools view -hb -f 130 pum2.aligned.prepro.bam -o pum2.aligned.prepro.R2.bam
samtools index pum2.aligned.prepro.R2.bam
As a reference genome, you can download e.g. the FASTA file from ENSEMBL:
wget -O ref.GRCh38.fa.gz ftp://ftp.ensembl.org/pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip ref.GRCh38.fa.gz
PureCLIP expects aligned reads as input, more precisely it assumes only reads containing information about potential truncation events: R1 for iCLIP data and R2 for eCLIP data. To run PureCLIP in basic mode, it requires BAM and BAI files, the reference genome and a specified output file:
pureclip -i pum2.aligned.prepro.R2.bam -bai pum2.aligned.prepro.R2.bam.bai -g ref.GRCh38.fa -iv '1;2;3;' -nt 10 -o PureCLIP.crosslink_sites.bed
With -iv
the chromosomes (or transcripts) can be specified that will be used to learn the parameters of PureCLIPs HMM.
This reduces the memory consumption and runtime.
Usually, learning on a small subset of the chromosomes, e.g. Chr1-3, does not impair the results noticeable.
However, in the case of very sparse data this can be adjusted.
With -nt
the number threads for parallelization can be specified.
To run PureCLIP with input control data, additionally hand over the (preprocessed) BAM file from the input experiment with -ibam
and the associated BAI file with -ibai
:
pureclip -i pum2.aligned.prepro.R2.bam -bai pum2.aligned.prepro.R2.bam.bai -g ref.GRCh38.fa -iv '1;2;3;' -nt 10 -o PureCLIP.crosslink_sites.cov_inputSignal.bed -g1g2k -ibam pum2_input.prepro.R2.bam -ibai pum2_input.prepro.R2.bam.bai
The parameter -g1g2k
constraints the shape parameter of the second gamma distribution to be smaller or equal than the shape parameter of the first gamma distribution.
To incorporate CL-motifs into the model of PureCLIP, first we need to compute position-wise CL-motif scores, indicating the positions CL-affinity. You can use the provided precompiled list of common_CL-motifs. If you want to compute the CL-motifs specific to the used eCLIP experiment, please have a look here. Assume we have give a set of CL-motifs, we need to use FIMO (Grant et. al, 2011) to compute motif occurrences associated with a score within your reference. The following script first retrieves reference regions covered by the target experiment, then runs FIMO to compute position-wise CL-motif match scores and chooses for each position the motif with the highest score:
export BEDTOOLS=/path/to/bedtools # if not specified, PATH is searched
export FIMO=/path/to/fimo # if not specified, PATH is searched
export WINEXTRACT=/path/to/winextract # built together with PureCLIP
compute_CLmotif_scores.sh ref.fa pum2.aligned.prepro.R2.bam motifs.xml motifs.txt fimo_clmotif_occurences.bed
The computed scores are then handed over to PureCLIP together with the parameter -nim 4
, indicating that scores with associated motif IDs 1-4 will be used (default: only scores with motif ID 1 will be used).
pureclip -i pum2.aligned.prepro.R2.bam -bai pum2.aligned.prepro.R2.bam.bai -g ref.GRCh38.fa -o PureCLIP.crosslink_sites.cov_CLmotifs.bed -nt 10 -iv '1;2;3;' -nim 4 -fis fimo_clmotif_occurences.bed
The main output of PureCLIP is a BED6 file, containing individual crosslink sites together with a score:
- chr: Name of the chromosome or scaffold.
- start: Position of crosslink site.
- end: Position behind crosslink site (start+1).
- state: '3'
- score: log posterior probability ratio of the first and second likely state.
- strand: + or -
Optionally, if an output file for binding regions is specified with -or, individual crosslink sites with a distance <= d (specified with -dm, default 8 bp) are merged and given out in a separate BED6 file:
- chr: Name of the chromosome or scaffold.
- start: Start position, position of first crosslink site.
- end: End position, position behind last crosslink site.
- indiv. scores: 'score1;score2;score3;'
- score: Sum of log posterior probability ratio scores.
- strand: + or -