Sample fragmented haplotypes #4523

jltsiren · 2025-02-12T02:58:12Z

Changelog Entry

To be copied to the draft changelog by merger:

Haplotype sampling can sample fragmented haplotypes in large snarls.

Description

Until now, haplotype sampling only considered haplotype paths that cross the entire block/subchain. HPRC v2 graphs have some large snarls where very few haplotypes do that. Most of them have been clipped into multiple fragments due to >10 kbp unaligned sequence intervals.

With this PR, we can sample any minimal end-to-end haplotype visit to a subchain, regardless of the number of fragments. We consider haplotype sequences consisting of a chain of paths / haplotype fragments, each of them sharing (sample name, haplotype number, sequence name). Fragments within a chain are ordered by starting offset / fragment number (fourth field in GBWT metadata). Note that due to HPRC naming conventions, we cannot cross assembly gaps this way. Because path names use accession numbers rather than chromosome names as sequence names, we cannot tell whether two assembly contigs correspond to the same chromosome and in which order.

Haplotype information (.hapl) files must be rebuilt to include fragmented haplotypes. Old files remain usable, but new files with fragmented haplotypes cannot be used with earlier versions of vg.

Also fixes #4517. GBWTGraph GFA parsing now accepts missing SeqStart/SeqEnd fields in W-lines.

I'm not going to merge this immediately, as we still need to determine the impact on read mapping and variant calling.

jltsiren added 13 commits January 26, 2025 18:43

Fix GBWT starting position key in gaf_sorter

4ad0fe2

Remove obsolete developer options from vg haplotypes

8377f92

Update GBWTGraph (see #4517)

9ff0ea8

Update GBWT for FragmentMap

188f711

Sample fragmented haplotypes if the .hapl file has them

9f454bd

Sampling fragmented haplotypes kind of works

88bf4ee

Update GBWT for additional debug information

7871c19

Bug fixes

fa04860

Validate fragmented haplotypes

c066d77

Extract node ids in a subchain

e28e643

Do not sample fragments outside the subchain

610660f

Sample fragmented extra fragments

c5fe6b9

Merge branch 'master' of https://github.com/vgteam/vg

5cfcdb4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample fragmented haplotypes #4523

Sample fragmented haplotypes #4523

jltsiren commented Feb 12, 2025

Sample fragmented haplotypes #4523

Are you sure you want to change the base?

Sample fragmented haplotypes #4523

Conversation

jltsiren commented Feb 12, 2025

Changelog Entry

Description