Skip to content
Sabrina Krakau edited this page Jul 20, 2018 · 19 revisions

Welcome to the sim_iCLIP wiki! (updated)

In the following tutorial we will describe how to simulate iCLIP/eCLIP data starting from real RNA-seq data and bona fide protein binding regions. Additionally, non-specific binding of background proteins can be simulated using published background binding regions and random noise from RNA-seq data.

Simulation of target-specific iCLIP-seq data

To simulate the data, we use aligned total RNA-seq data from ENCODE (ENCSR885DVH), the corresponding reference sequence, and bona fide binding regions for which we use PUM2 motif occurrences here:

wget -O rna-seq.rep1.bam https://www.encodeproject.org/files/ENCFF131IST/@@download/ENCFF131IST.bam
wget -O rna-seq.rep2.bam https://www.encodeproject.org/files/ENCFF726SMY/@@download/ENCFF726SMY.bam
samtools merge -f rna-seq.bam rna-seq.rep1.bam rna-seq.rep2.bam

samtools view -hb -f 130 rna-seq.bam -o rna-seq.R2.bam    # only use second read
samtools index rna-seq.R2.bam
wget -O ref.hg19.fa.gz https://www.encodeproject.org/files/female.hg19/@@download/female.hg19.fasta.gz 
gunzip ref.hg19.fa.gz
wget -O pum2_motif_matches.bed https://raw.githubusercontent.com/skrakau/PureCLIP_data/master/pum2_motif_matches/fimo.thresh0.01.intersectTranscript.d100.bed

Then we run the simulation with:

sim_iclip -nt 10 -bam rna-seq.R2.bam -bai rna-seq.R2.bam.bai -ref ref.hg19.fa -bs pum2_motif_matches.bed -urc -fld examples/normal_mean165_sd50_tp20.txt -out sim_iCLIP.bam

Currently, if no fixed fragment length is used for simulation, a file containing the fragment length distribution can be handed over with -fld (see examples/). The parameter -urc causes the number of simulated crosslink sites within one binding region to be drawn from a uniform distribution, instead of using a fixed number.

Simulation of iCLIP-seq data containing target-specific and background binding

To simulate additionally non-specific signal from background proteins, we use a published list of common background regions (Reyes-Herrera et al., 2015).

wget -O bg_regions.bed https://github.com/phrh/BackCLIP/raw/master/CommonBackground/BackgroundTraining_19datasets.bed

sim_iclip -nt 10 -bam rna-seq.R2.bam -bai rna-seq.R2.bam.bai -ref ref.hg19.fa -bs pum2_motif_matches.bed -urc -fld examples/normal_mean165_sd50_tp20.txt -bbs bg_regions.bed -out sim_iCLIP.bam

Adding random noise from RNA-seq data

samtools view -hb -s 0.01 rna-seq.R2.bam > random_noise.bam
samtools merge -f sim_iCLIP.with_noise.bam sim_iCLIP.bam random_noise.bam

User options

For a full list of user options, please type:

sim_iclip --help

Examples can be found here.