Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

computational efficiency of pggb #370

Open
yihangs opened this issue Jan 25, 2024 · 3 comments
Open

computational efficiency of pggb #370

yihangs opened this issue Jan 25, 2024 · 3 comments

Comments

@yihangs
Copy link

yihangs commented Jan 25, 2024

Hi,

A recent paper, "Comparing methods for constructing and representing human pangenome graphs", shows that pggb cannot construct graphs from 104 human haplotypes because of low computational efficiency. This result kind of contradicts to the results shown in the paper "A draft human pangenome reference", where pggb is used to construct graphs from around 90 haplotypes, a number very close to 104. Therefore, I am wondering the computational efficiency of pggb, can it deal with hundreds or even thousands of haplotypes? If not, what would be the key bottleneck?

Thanks!

@ekg
Copy link
Collaborator

ekg commented Jan 25, 2024

It seems that the cited paper had a misunderstanding about how the variation graph building methods are currently used in the HPRC. PGGB (and minigraph-cactus) are run on each chromosome individually. This allows for high parallelism in graph building. Just throwing all data from all human chromosomes in the HPRC into a single node is likely to take a very long time and produce a result which may be hard to understand. Improving the partitioning process is critical to enabling this kind of use. To minimize bias, we propose a community detection method to partition the graph building process into pieces that each can be processed independently on a cluster. Refining this is the main area of ongoing work with PGGB, as it will lead to automatic and unbiased graph building in any context, not just those where there is a clear partitioning by chromosome (or in humans, most chromosomes, the sex chromosomes, and the acrocentrics).

@subwaystation
Copy link
Member

subwaystation commented Jan 25, 2024

Also, the pggb version used in the paper https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03098-2 was 0.2.0 which was released Nov 2021. Since then lot's of performance updates were integrated into pggb.
I myself was already able to build a pangenome graph directly from all 90 haplotypes at once (not per chromosome) using https://nf-co.re/pangenome. This pipeline directly mirrors https://github.com/pangenome/pggb/blob/master/partition-before-pggb followed by https://github.com/pangenome/pggb/blob/master/pggb.
While I did not evaluate pggb 0.2.0, the current tools of pggb for sure are up to the task(s) executed in the mentioned paper. Even 104 haplotypes would run smoothly.

@yihangs
Copy link
Author

yihangs commented Feb 6, 2024

Thank you for the reply! I have another pggb related question, posted here: ekg/seqwish#121. I am wondering if you have any idea about that.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants