Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise feature matrix management to accelerate L2G Training and Prediction #3252

Closed
2 tasks
ireneisdoomed opened this issue Mar 11, 2024 · 3 comments · Fixed by opentargets/gentropy#745
Closed
2 tasks
Assignees

Comments

@ireneisdoomed
Copy link

ireneisdoomed commented Mar 11, 2024

As a developer, I want to optimise feature annotation processing because it will reduce computation time during the L2G training and prediction phases.

Background

L2G at the moment annotates all features of the input credible sets at execution time. We defined it like this because of

  • the fact that feature matrix is purely an intermediate dataset only useful in the process of L2G training/prediction
  • its reliability, we don't introduce a codependence between 2 files.

However, although sensible, in reality this approach makes L2G training under different scenarios inconvenient. Most of the computation time of the step itself goes into feature annotation, so that every single L2G training, where we annotate all credible sets, takes about 25 minutes.
This also affects the prediction part. In this step, only the credible sets for which we want to extract L2G scores are annotated, however I experienced unreasonably long times to extract predictions for 30 loci.

Tasks

  • Ensure business logic of the colocalisation factories doesn't have big bottlenecks that might be slowing the process down
  • If not, consider the idea of writing the feature matrix as another dataset. To ensure credible set/feature matrix compatibility, we must assert that all credible sets are part of the feature matrix.
@ireneisdoomed
Copy link
Author

This PR is relevant for the issue described here opentargets/gentropy#544

After my tests, I concluded that the majority of the computation time goes into the part of feature extraction (the generation of the long dataframe). There is a lot of logic there, but any improvement will make the process better.

@addramir
Copy link

Can we close this issue since it is duplicated in other issues?

@ireneisdoomed
Copy link
Author

Yes. Closing as there are no no specific actions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants