Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestions on Converting Transcript-Aligned BAM to Genomic-Aligned Ones? #35

Open
YU-Zhejian opened this issue Nov 5, 2024 · 2 comments

Comments

@YU-Zhejian
Copy link

Dear PBSIM3 maintainers,

I am YU Zhejian from the Zhejiang University-University of Edinburgh Institute. Our group has been using PBSIM1/2/3 to simulate long-read RNA-Seq reads (with YASIM simulation framework developed by our group) to benchmark long-read transcriptome assemblers, and is now trying to use PBSIM1/2/3 to generate reads for investigating issues caused by spliced-aligners. I wonder:

  • Do you have any suggestions on software that can convert transcript-aligned BAMs (or MAFs, as generated in PBSIM1/2/3 in CLR mode) to genomic-aligned ones given GTF and reference FASTA without re-alignment while preserving sequencing errors?
  • For simulating genomic-originated reads, it seems that you're not using HTSLib and I am interested in the underlying reason. In my understanding, PBSIM may support larger genomes with smaller memory footprints with HTSLib since it does not read the entire genomic reference to the memory.

Thanks for developing such an excellent tool. Any advice will be very helpful to me.

Yours sincerely,
YU Zhejian

@yukiteruono
Copy link
Owner

Thank you for using PBSIM3 and for your very useful comments.

Regarding the first question, it is possible to generate alignments between transcript reads and reference genome sequences with almost 100% accuracy. However, I am not sure whether it is possible to make the description of transcript coordinates strictly unique, and I was not able to fully consider transcripts such as circRNA. Therefore, the current version of PBSIM3 does not have that function.
Regarding the second question, your understanding is correct. However, even the current version of PBSIM3 can handle fairly large genomes with a moderate amount of memory.
In the future, I would like to reflect your comments in PBSIM3. However, it is difficult to do so in the near future.

@YU-Zhejian
Copy link
Author

Dear Dr. Ono,

Thanks for your fast response. I suppose inside a GTF file, it is possible for a gene to appear in different locations, but I haven't encountered any GTF with duplicated transcripts (identified by transcript_id which should be unique). I wonder whether you have handled any GTFs like this. At the current stage, I suppose I may try to implement a tool that may convert transcriptome-aligned MAFs/BAMs to genomic-aligned BAM that assumes unique transcript_id with mostly protein-coding genes without circRNAs on my own. I would appreciate it if you have any recommendations on existing tools that may accomplish this.

By the way, I think it may be a good idea to generate BAM instead of SAM with HTSLib while simulating multi-pass reads. Compared to SAM, the BAM format can be compressed, making it favorable to simulate scRNA-Seq data with a large number of cells at high depth (e.g., PacBio MAS-ISO-Seq data using PacBio Rovio sequencer or SPLiT-Seq libraries constructed using Parse Biosciences). One may benefit from the additional SAM format checker HTSLib, which will fail on incorrectly formatted reads (e.g., reads with unequal length at sequence and quality).

Thanks for reading this message. I am looking forward to your reply and wish you a nice day.

Yours sincerely,
YU Zhejian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants