Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement for HiFi : relate number of passes to dna fragment size #25

Open
Sebastien-Raguideau opened this issue Apr 10, 2024 · 2 comments

Comments

@Sebastien-Raguideau
Copy link

Hello,

Thanks for your software it is extremely useful.

I suppose this is just a shameless request for feature.

I have a slight misgiving on how number of passes and fragment length are not being related in data generation.

It is my understanding that polymerase reads follows a distribution which is unrelated to DNA fragment length. In effect that means that longer DNA fragment can't go through as many passes as shorter ones, leading to quite a different quality as a function of read length (high quality for small, low quality for big). That is why HiFi library preparation tends to limit maximum DNA fragment size, so that a minimum number of passes can be assured.

I am unsure of how to use your tools to reproduce a similar pattern. I suppose I could do something silly: when I want a coverage of 30, use pbsim 30 times asking for a coverage of 1 and having different parameters for number of passes and fragment length. That seems hacky and not too correct.

Would you be able to add this feature?

Best

@Sebastien-Raguideau Sebastien-Raguideau changed the title Improvement: relate number of passes to dna fragment size Improvement for HiFi : relate number of passes to dna fragment size Apr 10, 2024
@yukiteruono
Copy link
Owner

Thank you for your very interesting suggestion.

As you say, the longer the DNA fragment, the fewer the number of passes, so the longer the read, the lower the quality. Even in PBSIM3 simulations, changing the number of passes changes the quality of HiFi reads, as shown in Table S8 of the PBSIM3 paper. However, we do not have an accurate understanding of the relationship between read length and number of passes, and it is currently not possible to implement this relationship in PBSIM3.

If you understand the relationship between read length and number of passes, your method (repeat the simulation 30 times with different parameters) is a simple and good method.

@Sebastien-Raguideau
Copy link
Author

Sebastien-Raguideau commented Apr 11, 2024

Thanks for your quick answer!

I would just give a distribution for the polymerase reads length, lets say centered around 200k + some std (this can be learned or left as a parameter for user).
Then for each read, sample a dna fragment length, sample a polymerase fragment length and deduce the expected number of passes by taking the ratio.

I can generate easily a file which list all couples (dna fragment length, nb of passes) for all reads, so to obtain a set coverage. Though pbsim3 would not be able to take that as input at the moment.

I am not too keen on using the 30 times methods: that imply having a weird discretization and I do intend to simulate coverage going as low as 0.5 (metagenomic mix).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants