Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changelog Entry
To be copied to the draft changelog by merger:
Description
Until now, haplotype sampling only considered haplotype paths that cross the entire block/subchain. HPRC v2 graphs have some large snarls where very few haplotypes do that. Most of them have been clipped into multiple fragments due to >10 kbp unaligned sequence intervals.
With this PR, we can sample any minimal end-to-end haplotype visit to a subchain, regardless of the number of fragments. We consider haplotype sequences consisting of a chain of paths / haplotype fragments, each of them sharing (sample name, haplotype number, sequence name). Fragments within a chain are ordered by starting offset / fragment number (fourth field in GBWT metadata). Note that due to HPRC naming conventions, we cannot cross assembly gaps this way. Because path names use accession numbers rather than chromosome names as sequence names, we cannot tell whether two assembly contigs correspond to the same chromosome and in which order.
Haplotype information (
.hapl
) files must be rebuilt to include fragmented haplotypes. Old files remain usable, but new files with fragmented haplotypes cannot be used with earlier versions of vg.Also fixes #4517. GBWTGraph GFA parsing now accepts missing SeqStart/SeqEnd fields in W-lines.
I'm not going to merge this immediately, as we still need to determine the impact on read mapping and variant calling.