computational efficiency of pggb #370

yihangs · 2024-01-25T03:40:39Z

Hi,

A recent paper, "Comparing methods for constructing and representing human pangenome graphs", shows that pggb cannot construct graphs from 104 human haplotypes because of low computational efficiency. This result kind of contradicts to the results shown in the paper "A draft human pangenome reference", where pggb is used to construct graphs from around 90 haplotypes, a number very close to 104. Therefore, I am wondering the computational efficiency of pggb, can it deal with hundreds or even thousands of haplotypes? If not, what would be the key bottleneck?

Thanks!

ekg · 2024-01-25T08:29:33Z

It seems that the cited paper had a misunderstanding about how the variation graph building methods are currently used in the HPRC. PGGB (and minigraph-cactus) are run on each chromosome individually. This allows for high parallelism in graph building. Just throwing all data from all human chromosomes in the HPRC into a single node is likely to take a very long time and produce a result which may be hard to understand. Improving the partitioning process is critical to enabling this kind of use. To minimize bias, we propose a community detection method to partition the graph building process into pieces that each can be processed independently on a cluster. Refining this is the main area of ongoing work with PGGB, as it will lead to automatic and unbiased graph building in any context, not just those where there is a clear partitioning by chromosome (or in humans, most chromosomes, the sex chromosomes, and the acrocentrics).

subwaystation · 2024-01-25T08:50:46Z

Also, the pggb version used in the paper https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03098-2 was 0.2.0 which was released Nov 2021. Since then lot's of performance updates were integrated into pggb.
I myself was already able to build a pangenome graph directly from all 90 haplotypes at once (not per chromosome) using https://nf-co.re/pangenome. This pipeline directly mirrors https://github.com/pangenome/pggb/blob/master/partition-before-pggb followed by https://github.com/pangenome/pggb/blob/master/pggb.
While I did not evaluate pggb 0.2.0, the current tools of pggb for sure are up to the task(s) executed in the mentioned paper. Even 104 haplotypes would run smoothly.

yihangs · 2024-02-06T21:27:45Z

Thank you for the reply! I have another pggb related question, posted here: ekg/seqwish#121. I am wondering if you have any idea about that.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

computational efficiency of pggb #370

computational efficiency of pggb #370

yihangs commented Jan 25, 2024

ekg commented Jan 25, 2024

subwaystation commented Jan 25, 2024 •

edited

Loading

yihangs commented Feb 6, 2024

computational efficiency of pggb #370

computational efficiency of pggb #370

Comments

yihangs commented Jan 25, 2024

ekg commented Jan 25, 2024

subwaystation commented Jan 25, 2024 • edited Loading

yihangs commented Feb 6, 2024

subwaystation commented Jan 25, 2024 •

edited

Loading