Suggestions on Converting Transcript-Aligned BAM to Genomic-Aligned Ones? #35

YU-Zhejian · 2024-11-05T01:52:46Z

Dear PBSIM3 maintainers,

I am YU Zhejian from the Zhejiang University-University of Edinburgh Institute. Our group has been using PBSIM1/2/3 to simulate long-read RNA-Seq reads (with YASIM simulation framework developed by our group) to benchmark long-read transcriptome assemblers, and is now trying to use PBSIM1/2/3 to generate reads for investigating issues caused by spliced-aligners. I wonder:

Do you have any suggestions on software that can convert transcript-aligned BAMs (or MAFs, as generated in PBSIM1/2/3 in CLR mode) to genomic-aligned ones given GTF and reference FASTA without re-alignment while preserving sequencing errors?
For simulating genomic-originated reads, it seems that you're not using HTSLib and I am interested in the underlying reason. In my understanding, PBSIM may support larger genomes with smaller memory footprints with HTSLib since it does not read the entire genomic reference to the memory.

Thanks for developing such an excellent tool. Any advice will be very helpful to me.

Yours sincerely,
YU Zhejian

yukiteruono · 2024-11-05T09:30:10Z

Thank you for using PBSIM3 and for your very useful comments.

Regarding the first question, it is possible to generate alignments between transcript reads and reference genome sequences with almost 100% accuracy. However, I am not sure whether it is possible to make the description of transcript coordinates strictly unique, and I was not able to fully consider transcripts such as circRNA. Therefore, the current version of PBSIM3 does not have that function.
Regarding the second question, your understanding is correct. However, even the current version of PBSIM3 can handle fairly large genomes with a moderate amount of memory.
In the future, I would like to reflect your comments in PBSIM3. However, it is difficult to do so in the near future.

YU-Zhejian · 2024-11-05T16:08:57Z

Dear Dr. Ono,

Thanks for your fast response. I suppose inside a GTF file, it is possible for a gene to appear in different locations, but I haven't encountered any GTF with duplicated transcripts (identified by transcript_id which should be unique). I wonder whether you have handled any GTFs like this. At the current stage, I suppose I may try to implement a tool that may convert transcriptome-aligned MAFs/BAMs to genomic-aligned BAM that assumes unique transcript_id with mostly protein-coding genes without circRNAs on my own. I would appreciate it if you have any recommendations on existing tools that may accomplish this.

By the way, I think it may be a good idea to generate BAM instead of SAM with HTSLib while simulating multi-pass reads. Compared to SAM, the BAM format can be compressed, making it favorable to simulate scRNA-Seq data with a large number of cells at high depth (e.g., PacBio MAS-ISO-Seq data using PacBio Rovio sequencer or SPLiT-Seq libraries constructed using Parse Biosciences). One may benefit from the additional SAM format checker HTSLib, which will fail on incorrectly formatted reads (e.g., reads with unequal length at sequence and quality).

Thanks for reading this message. I am looking forward to your reply and wish you a nice day.

Yours sincerely,
YU Zhejian

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestions on Converting Transcript-Aligned BAM to Genomic-Aligned Ones? #35

Suggestions on Converting Transcript-Aligned BAM to Genomic-Aligned Ones? #35

YU-Zhejian commented Nov 5, 2024

yukiteruono commented Nov 5, 2024

YU-Zhejian commented Nov 5, 2024

Suggestions on Converting Transcript-Aligned BAM to Genomic-Aligned Ones? #35

Suggestions on Converting Transcript-Aligned BAM to Genomic-Aligned Ones? #35

Comments

YU-Zhejian commented Nov 5, 2024

yukiteruono commented Nov 5, 2024

YU-Zhejian commented Nov 5, 2024