Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the usage of vg deconstruct #4356

Open
Tonitsk8264 opened this issue Jul 26, 2024 · 5 comments
Open

About the usage of vg deconstruct #4356

Tonitsk8264 opened this issue Jul 26, 2024 · 5 comments

Comments

@Tonitsk8264
Copy link

Hi!

I used vg chunk to extract subgraphs from a pan-genome graph and then used vg deconstruct to obtain VCF files for both the pan-genome graph and the subgraph. However, I noticed that the VCF file for the subgraph contains variants that are not present in the pan-genome graph. Is this expected?

Thank you very much!

企业微信截图_1721986245558

企业微信截图_17219861274506

@glennhickey
Copy link
Contributor

Yeah, this is expected and super annoying. The issue is that the snarl decomposition, which determines the sites in the VCF, is somewhat dependent on the "rooting" of the snarl tree. And this rooting can be insconsistent between a graph and subgraph, when computed independently.

One work-around might be to use vg snarls <graph> -A cactus > <graph>.snarls on both graphs, then pass the output to deconstruct -r. This reverts back to the older snarl logic which I think had some heuristics to anchor snarls to reference paths. @adamnovak how feasible do you think it would be to add something like this back to the new snarl finder?

@adamnovak
Copy link
Member

I think the main obstacle to doing something like that was UI; we'd need to come up with a way for the user to explain how they want their snarls rooted and then bring it all the way through, and we'd probably want to have it in anything that can need to make snarls.

Now that we have a stronger notion of reference-sense paths, it might be possible to prefer those for anchoring by default?

@Han-Cao
Copy link

Han-Cao commented Jul 30, 2024

Hi,

I am wondering if it is possible to use vg snarls -A cactus (or any other way) to deconstruct consistent variants between haplotype sampled graphs and the original graph? This would be very helpful for SV merging when using haplotype sampling in population study.

@glennhickey
Copy link
Contributor

The idea now is for using haplotype sampling only for mapping. You still call all samples on the full graph (with which the sampled GAMs are compatible)

@Han-Cao
Copy link

Han-Cao commented Aug 3, 2024

Hi @glennhickey ,

Thanks for the suggestion, I have tried calling variants on the full graph. I ran vg deconstruct -a and vg call -A -a -z with the same gbz and snarls file and then used vcfbub, decompose tool, and bcftools norm -f ref.fa -m - to generate decomposed biallelic VCFs for comparison.

When considering variants with INDEL length> 20 (i.e., bcftools view -i 'abs(ILEN)>20):

  • The variant number of vg call is ~20% of vg deconstruct
  • ~99.6% of the variants from vg call can be exactly matched with one variant in vg deconstruct by chr, pos, ref, alt

It is really great to see most of variants of vg call are consistent with vg deconstruct, but there are a lot of variants are missing. I read the previous discussion about the difference between vg call and vg deconstruct in #3888 , is such difference still expected? For the variants missing in vg call, is it OK to assign them as './.' or even '0/0'?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants