Improvement for HiFi : relate number of passes to dna fragment size #25

Sebastien-Raguideau · 2024-04-10T16:29:04Z

Hello,

Thanks for your software it is extremely useful.

I suppose this is just a shameless request for feature.

I have a slight misgiving on how number of passes and fragment length are not being related in data generation.

It is my understanding that polymerase reads follows a distribution which is unrelated to DNA fragment length. In effect that means that longer DNA fragment can't go through as many passes as shorter ones, leading to quite a different quality as a function of read length (high quality for small, low quality for big). That is why HiFi library preparation tends to limit maximum DNA fragment size, so that a minimum number of passes can be assured.

I am unsure of how to use your tools to reproduce a similar pattern. I suppose I could do something silly: when I want a coverage of 30, use pbsim 30 times asking for a coverage of 1 and having different parameters for number of passes and fragment length. That seems hacky and not too correct.

Would you be able to add this feature?

Best

yukiteruono · 2024-04-11T08:46:24Z

Thank you for your very interesting suggestion.

As you say, the longer the DNA fragment, the fewer the number of passes, so the longer the read, the lower the quality. Even in PBSIM3 simulations, changing the number of passes changes the quality of HiFi reads, as shown in Table S8 of the PBSIM3 paper. However, we do not have an accurate understanding of the relationship between read length and number of passes, and it is currently not possible to implement this relationship in PBSIM3.

If you understand the relationship between read length and number of passes, your method (repeat the simulation 30 times with different parameters) is a simple and good method.

Sebastien-Raguideau · 2024-04-11T10:57:48Z

Thanks for your quick answer!

I would just give a distribution for the polymerase reads length, lets say centered around 200k + some std (this can be learned or left as a parameter for user).
Then for each read, sample a dna fragment length, sample a polymerase fragment length and deduce the expected number of passes by taking the ratio.

I can generate easily a file which list all couples (dna fragment length, nb of passes) for all reads, so to obtain a set coverage. Though pbsim3 would not be able to take that as input at the moment.

I am not too keen on using the 30 times methods: that imply having a weird discretization and I do intend to simulate coverage going as low as 0.5 (metagenomic mix).

Sebastien-Raguideau changed the title ~~Improvement: relate number of passes to dna fragment size~~ Improvement for HiFi : relate number of passes to dna fragment size Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvement for HiFi : relate number of passes to dna fragment size #25

Improvement for HiFi : relate number of passes to dna fragment size #25

Sebastien-Raguideau commented Apr 10, 2024

yukiteruono commented Apr 11, 2024

Sebastien-Raguideau commented Apr 11, 2024 •

edited

Loading

Improvement for HiFi : relate number of passes to dna fragment size #25

Improvement for HiFi : relate number of passes to dna fragment size #25

Comments

Sebastien-Raguideau commented Apr 10, 2024

yukiteruono commented Apr 11, 2024

Sebastien-Raguideau commented Apr 11, 2024 • edited Loading

Sebastien-Raguideau commented Apr 11, 2024 •

edited

Loading