Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop V2G step #3434

Open
3 tasks
ireneisdoomed opened this issue Aug 29, 2024 · 2 comments · May be fixed by opentargets/gentropy#771
Open
3 tasks

Drop V2G step #3434

ireneisdoomed opened this issue Aug 29, 2024 · 2 comments · May be fixed by opentargets/gentropy#771
Assignees
Labels
Genetics Relates to Open Targets genetics team Gentropy Relates to the genetics ETL

Comments

@ireneisdoomed
Copy link

ireneisdoomed commented Aug 29, 2024

As a developer I want to delete the step that generates a variant to gene dataset because it is an auxiliary concept that is only useful in the context of generating features for L2G.

Background

Discussed during the Gentropy meeting 29/08.

A variant-to-gene (V2G) evidence is understood as any piece of evidence that supports the association of a variant with a gene. Current V2G sources are:

  • Distance of a variant to the gene TSS
  • Severity score between a variant and VEP's predicted consequence
  • Flag indicating if the variant is predicted to be a loss-of-function variant by the LOFTEE algorithm
  • Linkage between genomic regions and genes based on genome interaction studies

These evidence are all scored in a way that higher means a more confident linkage with the gene.

All features instead of the interval data is variant annotation extracted from VEP that we currently have in the variant index. This, and the fact that V2G as a concept is only useful as a temporary dataset used to annotate credible sets in L2G, makes us want to remove the generation of this dataset.

Tasks

  • Remove the variant_to_gene step from Gentropy - update docs
  • Move V2G extraction into the L2G feature factories so that these relationships are generated and used during runtime only
  • Indirect task: because we have seen performance issues in this step, we want to make sure that moving the logic of V2G doesn't affect L2G performance. For that, we want to explore sorting the variant index by chromosome and position as an optimisation of the process

I think I'd still keep the variant_to_gene data model because it is useful as a concept and for testing purposes. But maybe we decide in the refactoring that it is not actually needed.

@ireneisdoomed ireneisdoomed added the Gentropy Relates to the genetics ETL label Aug 29, 2024
@project-defiant project-defiant added the Genetics Relates to Open Targets genetics team label Sep 12, 2024
@project-defiant
Copy link

project-defiant commented Sep 12, 2024

@xyg123 @addramir FYI, most likely we could scope this issue directly

@addramir
Copy link

Can we close this issue as it duplicates #3258?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Genetics Relates to Open Targets genetics team Gentropy Relates to the genetics ETL
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants