-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Welcome to the sim_iCLIP wiki! (updated)
In the following tutorial we will describe how to simulate iCLIP/eCLIP data starting from real RNA-seq data and bona fide protein binding regions. Additionally, non-specific binding of background proteins can be simulated using published background binding regions and random noise from RNA-seq data.
To simulate the data, we use aligned total RNA-seq data from ENCODE (ENCSR885DVH), the corresponding reference sequence, and bona fide binding regions for which we use PUM2 motif occurrences here:
wget -O rna-seq.rep1.bam https://www.encodeproject.org/files/ENCFF131IST/@@download/ENCFF131IST.bam
wget -O rna-seq.rep2.bam https://www.encodeproject.org/files/ENCFF726SMY/@@download/ENCFF726SMY.bam
samtools merge -f rna-seq.bam rna-seq.rep1.bam rna-seq.rep2.bam
samtools view -hb -f 130 rna-seq.bam -o rna-seq.R2.bam # only use second read
samtools index rna-seq.R2.bam
wget -O ref.hg19.fa.gz https://www.encodeproject.org/files/female.hg19/@@download/female.hg19.fasta.gz
gunzip ref.hg19.fa.gz
wget -O pum2_motif_matches.bed https://raw.githubusercontent.com/skrakau/PureCLIP_data/master/pum2_motif_matches/fimo.thresh0.01.intersectTranscript.d100.bed
Then we run the simulation with:
sim_iclip -nt 10 -bam rna-seq.R2.bam -bai rna-seq.R2.bam.bai -ref ref.hg19.fa -bs pum2_motif_matches.bed -urc -fld examples/normal_mean165_sd50_tp20.txt -out sim_iCLIP.bam
Currently, if no fixed fragment length is used for simulation, a file containing the fragment length distribution can be handed over with -fld
(see examples/).
The parameter -urc
causes the number of simulated crosslink sites within one binding region to be drawn from a uniform distribution, instead of using a fixed number.
To simulate additionally non-specific signal from background proteins, we use a published list of common background regions (Reyes-Herrera et al., 2015).
wget -O bg_regions.bed https://github.com/phrh/BackCLIP/raw/master/CommonBackground/BackgroundTraining_19datasets.bed
sim_iclip -nt 10 -bam rna-seq.R2.bam -bai rna-seq.R2.bam.bai -ref ref.hg19.fa -bs pum2_motif_matches.bed -urc -fld examples/normal_mean165_sd50_tp20.txt -bbs bg_regions.bed -out sim_iCLIP.bam
samtools view -hb -s 0.01 rna-seq.R2.bam > random_noise.bam
samtools merge -f sim_iCLIP.with_noise.bam sim_iCLIP.bam random_noise.bam
For a full list of user options, please type:
sim_iclip --help
Examples can be found here.