Add normalizer that snaps together redundant path traversals through sites #4396
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changelog Entry
To be copied to the draft changelog by merger:
vg paths -n
option added to normalize graphs using path information to "snap together" redundant paths through snarls. After running, no two path traversals through a snarl will ever produce the same sequence string without the traversals themselves being identical.Description
As mentioned here,
AT
fields indeconstruct
ed VCFs can be wrong in the sense that they do not reflect the actual path in the graph, but rather an equivalent (produces the same DNA sequence) path from some other haplotype.This PR adds an option to explicitly check a graph for these cases and remove them. The logic is
deconstruct
) and determine which if any produce the same string.I was a bit surprised how little this ended up changing the graph in the end (which I guess means cactus/abpoa/gfaffix are doing a pretty good job already). On
hprc-mc-v1.1-grch38
the normalized graph has onlySo not much impact. But, while I don't have a log to count, the majority of paths snapped do not result in nodes/edges lost, so I think/hope the path representation is cleaned up more than these numbers indicate.
In any case, it's fast enough to run and the fact that it guarantees correct AT fields in the VCF seems like good enough reason to run it by default minigraph-cactus...