About the usage of vg deconstruct #4356

Tonitsk8264 · 2024-07-26T09:35:19Z

Hi!

I used vg chunk to extract subgraphs from a pan-genome graph and then used vg deconstruct to obtain VCF files for both the pan-genome graph and the subgraph. However, I noticed that the VCF file for the subgraph contains variants that are not present in the pan-genome graph. Is this expected?

Thank you very much!

glennhickey · 2024-07-26T12:58:14Z

Yeah, this is expected and super annoying. The issue is that the snarl decomposition, which determines the sites in the VCF, is somewhat dependent on the "rooting" of the snarl tree. And this rooting can be insconsistent between a graph and subgraph, when computed independently.

One work-around might be to use vg snarls <graph> -A cactus > <graph>.snarls on both graphs, then pass the output to deconstruct -r. This reverts back to the older snarl logic which I think had some heuristics to anchor snarls to reference paths. @adamnovak how feasible do you think it would be to add something like this back to the new snarl finder?

adamnovak · 2024-07-26T14:51:06Z

I think the main obstacle to doing something like that was UI; we'd need to come up with a way for the user to explain how they want their snarls rooted and then bring it all the way through, and we'd probably want to have it in anything that can need to make snarls.

Now that we have a stronger notion of reference-sense paths, it might be possible to prefer those for anchoring by default?

Han-Cao · 2024-07-30T03:15:01Z

Hi,

I am wondering if it is possible to use vg snarls -A cactus (or any other way) to deconstruct consistent variants between haplotype sampled graphs and the original graph? This would be very helpful for SV merging when using haplotype sampling in population study.

glennhickey · 2024-07-30T14:19:47Z

The idea now is for using haplotype sampling only for mapping. You still call all samples on the full graph (with which the sampled GAMs are compatible)

Han-Cao · 2024-08-03T09:51:42Z

Hi @glennhickey ,

Thanks for the suggestion, I have tried calling variants on the full graph. I ran vg deconstruct -a and vg call -A -a -z with the same gbz and snarls file and then used vcfbub, decompose tool, and bcftools norm -f ref.fa -m - to generate decomposed biallelic VCFs for comparison.

When considering variants with INDEL length> 20 (i.e., bcftools view -i 'abs(ILEN)>20):

The variant number of vg call is ~20% of vg deconstruct
~99.6% of the variants from vg call can be exactly matched with one variant in vg deconstruct by chr, pos, ref, alt

It is really great to see most of variants of vg call are consistent with vg deconstruct, but there are a lot of variants are missing. I read the previous discussion about the difference between vg call and vg deconstruct in #3888 , is such difference still expected? For the variants missing in vg call, is it OK to assign them as './.' or even '0/0'?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the usage of vg deconstruct #4356

About the usage of vg deconstruct #4356

Tonitsk8264 commented Jul 26, 2024

glennhickey commented Jul 26, 2024

adamnovak commented Jul 26, 2024

Han-Cao commented Jul 30, 2024

glennhickey commented Jul 30, 2024

Han-Cao commented Aug 3, 2024

About the usage of vg deconstruct #4356

About the usage of vg deconstruct #4356

Comments

Tonitsk8264 commented Jul 26, 2024

glennhickey commented Jul 26, 2024

adamnovak commented Jul 26, 2024

Han-Cao commented Jul 30, 2024

glennhickey commented Jul 30, 2024

Han-Cao commented Aug 3, 2024