-
Notifications
You must be signed in to change notification settings - Fork 9
Home
Welcome to the PureCLIP wiki!
PureCLIP is a tool to detect protein-RNA interaction footprints from single-nucleotide CLIP-seq data, such as iCLIP and eCLIP.
Before you start, please check
We use the PUM2 eCLIP data from ENCODE (Van Nostrand et. al, 2016), preprocessed as described in the previous step.
Alternatively you can download the preprocessed ENCODE data, and filter the paired-end data to keep only R2:
samtools view -hb -f 130 pum2.aligned.prepro.bam -o pum2.aligned.prepro.R2.bam
samtools index pum2.aligned.prepro.R2.bam
As a reference genome, you can download e.g. the FASTA file from ENSEMBL:
wget ftp://ftp.ensembl. org/pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_ assembly.fa.gz -O ref.GRCh38.fa
PureCLIP expects aligned reads as input, more precisely it assumes only reads containing information about potential truncation events: R1 for iCLIP data and R2 for eCLIP data. To run PureCLIP in basic mode, it requires BAM and BAI files, the reference genome and a specified output file:
pureclip -i pum2.aligned.prepro.R2.bam -bai pum2.aligned.prepro.R2.bam.bai -g ref.GRCh38.fa -iv '1;2;3;' -nt 10 -o PureCLIP.crosslink_sites.bed
With -iv the chromosomes (or transcripts) can be specified that will be used to learn the parameters of PureCLIPs HMM. This reduces the memory consumption and runtime. Usually, learning on a small subset of the chromosomes, e.g. Chr1-3, does not impair the results noticeable. However, in the case of very sparse data this can be adjusted. With -nt the number threads for parallelization can be specified.
To run PureCLIP with input control data, additionally hand over the (preprocessed) BAM file from the input experiment with -ibam and the associated BAI file with -bai:
pureclip -i pum2.aligned.prepro.R2.bam -bai pum2.aligned.prepro.R2.bam.bai -g ref.GRCh38.fa -o PureCLIP.crosslink_sites.cov_inputSignal.bed -nt 10 -iv '1;2;3;' -g1g2k -ibam input_pum2.prepro.R2.bam -ibai input_pum2.prepro.R2.bam.bai
To incorporate CL-motifs into the model of PureCLIP, first we need to compute position-wise CL-motif scores, indicating the positions CL-affinity. You can use the provided precompiled list of common_CL-motifs. If you want to compute the CL-motifs specific to the used eCLIP experiment, please have a look here. Assume we have give a set of CL-motifs, we need to use FIMO (Grant et. al, 2011) to compute motif occurrences associated with a score within your reference. The following script first retrieves reference regions covered by the target experiment, then runs FIMO to compute position-wise CL-motif match scores and chooses for each position the motif with the highest score:
export BEDTOOLS=/path/to/bedtools # if not specified, PATH is searched
export FIMO=/path/to/fimo # if not specified, PATH is searched
export WINEXTRACT=/path/to/winextract # built together with PureCLIP
compute_CLmotif_scores.sh ref.fa pum2.aligned.prepro.R2.bam motifs.xml motifs.txt fimo_clmotif_occurences.bed
The computed scores are then handed over to PureCLIP together with the parameter -nim 4, indicating that scores with associated motif IDs 1-4 will be used (default: only scores with motif ID 1 will be used).
pureclip -i pum2.aligned.prepro.R2.bam -bai pum2.aligned.prepro.R2.bam.bai -g ref.GRCh38.fa -o PureCLIP.crosslink_sites.cov_CLmotifs.bed -nt 10 -iv '1;2;3;' -nim 4 -fis fimo_clmotif_occurences.bed
The main output of PureCLIP is a BED6 file, containing individual crosslink sites together with a score:
chromosome, start, (start+1), (state=3), score, strand
Optionally, if an output file for binding regions is specified with --or, individual crosslink sites with a distance <= d (specified with --dm) are merged and given out in a separate BED6 file:
chromosome, start, end, 'score1;score2;score3;', score, strand
where the 4th column contains the individiual crosslink scores, while the 5th column is the sum of these scores.