Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample fragmented haplotypes #4523

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
Open

Conversation

jltsiren
Copy link
Contributor

Changelog Entry

To be copied to the draft changelog by merger:

  • Haplotype sampling can sample fragmented haplotypes in large snarls.

Description

Until now, haplotype sampling only considered haplotype paths that cross the entire block/subchain. HPRC v2 graphs have some large snarls where very few haplotypes do that. Most of them have been clipped into multiple fragments due to >10 kbp unaligned sequence intervals.

With this PR, we can sample any minimal end-to-end haplotype visit to a subchain, regardless of the number of fragments. We consider haplotype sequences consisting of a chain of paths / haplotype fragments, each of them sharing (sample name, haplotype number, sequence name). Fragments within a chain are ordered by starting offset / fragment number (fourth field in GBWT metadata). Note that due to HPRC naming conventions, we cannot cross assembly gaps this way. Because path names use accession numbers rather than chromosome names as sequence names, we cannot tell whether two assembly contigs correspond to the same chromosome and in which order.

Haplotype information (.hapl) files must be rebuilt to include fragmented haplotypes. Old files remain usable, but new files with fragmented haplotypes cannot be used with earlier versions of vg.

Also fixes #4517. GBWTGraph GFA parsing now accepts missing SeqStart/SeqEnd fields in W-lines.

I'm not going to merge this immediately, as we still need to determine the impact on read mapping and variant calling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

VG autoindex does not work for gfa walks when the SeqStart or SeqEnd fields are equal to "*"
2 participants