refactor(L2GFeatureMatrix)!: streamline feature matrix management #745

ireneisdoomed · 2024-09-03T12:28:41Z

✨ Context

Rewrite of all the logic to generate features. Not all previous features are available (more below); I want to know if this POC is good to keep working.

L2GFeatureMatrix.from_features_list will be the main entry point. It reads a list of feature names from the config and instantiates the FeatureFactory. FeatureFactory.generate_features is the responsible of iterating over this feature list.

A schema of the new design is here:

https://excalidraw.com/#json=FSOwgU37GReVJuRGCjE6f,LAfIi3o_VJ635XCHNBmG7g

🛠 What does this PR implement

Feature Factory changes:

L2GFeatureMatrix is no longer a dataset subject to schema validation
L2GFeature is an abstract class with 2 attributes: study_loci_to_annotate and feature_dependency_type and one key method compute. study_loci_to_annotate is the gold standard or the study locus object on which the feature will be based, and feature_dependency_type is the type(s) of the objects required to compute that evidence
Addition of L2GFeatureInputLoader, a class in feature factory that generates a dictionary with the dependencies to compute features. The key is the name of the dependency, the value is the content. It is generated reading kwargs.
Colocalisation changes:
Rewrite of the colocalisation feature factory for the local features. Even though the coloc features are split by method and QTL type, I was able to abstract most of the core business inside Colocalisation.extract_maximum_coloc_probability_per_region_and_gene. The code we have to write per feature is now minimal.
Addition of Colocalisation.append_right_study_metadata to bring the study type and the gene info from the right association

Other:

Addition of StudyLocus.build_feature_matrix to call within the step
Addition of L2GGoldStandard.build_feature_matrix to call within the step
The feature matrix is now written during the training step. It makes more sense to dump the dataset that produced the model.
Unit and integration tests (the semantic ones in coloc are commented out)

🙈 Missing

Should I split the feature_factory.py module per feature group?
Not all previous features are available:

The local ones extracted from colocalisation are ready.
The neighborhood ones extracted from colocalisation are not there (these are derived from the local)
The ones derived from distance are pending to be included (and parsed from the variant index). I didn't want this PR to become even larger.

🚦 Before submitting

Do these changes cover one single feature (one change at a time)?
Did you read the contributor guideline?
Did you make sure to update the documentation with your changes?
Did you make sure there is no commented out code in this PR?
Did you follow conventional commits standards in PR title and commit messages?
Did you make sure the branch is up-to-date with the dev branch?
Did you write any new necessary tests?
Did you make sure the changes pass local tests (make test)?
Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

…-3252

…tributes

…-3252

…o il-3252

…-3252

project-defiant

Oveall the solution looks nice, for now on adding new features should be straight forward.
@xyg123 please check the feature implementations, as I do not have so much experience with the distance features.

config/step/ot_locus_to_gene_predict.yaml

project-defiant · 2024-09-19T07:55:21Z

src/gentropy/dataset/colocalisation.py

+                f"Colocalisation method {filter_by_colocalisation_method} is not supported."
+            )
+
+        method_colocalisation_metric = ColocalisationStep._get_colocalisation_class(


Using the ColocalisationStep might introduce a dependency cycle. In case you need this function, it's better to move it in this class directly and then call it in this place as well as in the ColocalisationStep directly.

The method should also ensure that the coloc method defined in filter_by_colocalisation_method is a valid name.

This function is used in the ColocalisationStep directly. It is a mapper between the method name and the method class. How would that introduce a dependency issue?

The method should also ensure that the coloc method defined in filter_by_colocalisation_method is a valid name.

This is happening

The imports inside the function are a matter to discuss, you usually should place them at the top (as PEP8) suggests it to do. I normally to a think like this when I would introduce a dependency cycle (which should be avoided). But moving this method to the bottom level module seems more appropriate in this case.

Having such a method call introduces a risk, when someone tries to import back the Colocalisation dataset in the step directly.

For now it might not be a problem, but it's not standard approach and can cause trouble in the future, that is all :)

project-defiant · 2024-09-19T08:35:04Z

src/gentropy/dataset/colocalisation.py

+        """
+        # TODO: make this flexible to bring metadata from the left study (2 joins)
+        return self.df.join(
+            study_loci.df.selectExpr(


Just to be on the safe side, I would make the joins between studyLocus and studyIndex first and drop duplicates (if any) on the id pair before rejoining it to the coloc df. Curious about your opinion on that.

Like this? I'm fine with that, probably it is more readable because it doesn't start operating with the left/right nomenclature until later

return ( # Annotate study loci with study metadata study_loci.df.select("studyLocusId", "studyId") .join( f.broadcast(study_index.df.select("studyId", *metadata_cols)), "studyId" ) # Append that to the right side of the colocalisation dataset .selectExpr( "studyLocusId as rightStudyLocusId", *[f"{col} as right{col[0].upper() + col[1:]}" for col in metadata_cols], ) .join(self.df, "rightStudyLocusId", "right") )

Yes, but my key point was to add the drop_duplicates(["studyLocusId"])

src/gentropy/dataset/l2g_feature.py

project-defiant · 2024-09-19T08:53:32Z

src/gentropy/dataset/l2g_feature.py

+        credible_set: StudyLocus | None = None,
+    ) -> None:
+        """Initializes a L2GFeature dataset.
+


Add to the docs, that this class should contain the common methods used to build the instances of Feature datasets.

project-defiant · 2024-09-19T09:21:55Z

src/gentropy/dataset/l2g_feature.py

+        )
+
+
+class PQtlColocClppMaximumFeature(L2GFeature):


Now I can see this is very verbose. This will be a field to improve.

project-defiant · 2024-09-19T11:31:44Z

src/gentropy/dataset/l2g_feature_matrix.py

@@ -145,7 +120,7 @@ def fill_na(
        Returns:
            L2GFeatureMatrix: L2G feature matrix dataset
        """
-        self.df = self._df.fillna(value, subset=subset)
+        self._df = self._df.fillna(value, subset=subset)


Create new object instead of overwriting the _df. I am planning to make it as a feature, so the Dataset objects are immutable.

project-defiant · 2024-09-19T12:23:48Z

tests/gentropy/dataset/test_l2g_feature.py

+            feature_class.feature_dependency_type
+        ),
+    )
+    assert isinstance(feature_dataset, L2GFeature)


It would be good to test if all of the requested features are actually in the dataframe columns.

project-defiant · 2024-09-19T12:29:30Z

tests/gentropy/dataset/test_colocalisation.py

+    )
+    expected_cols = ["studyLocusId", "geneId", "h4"]
+    for col in expected_cols:
+        assert col in res_df.columns, f"Column {col} not found in result DataFrame."


Can you explicitly compare the expected values with the output column values

project-defiant · 2024-09-19T12:31:09Z

src/gentropy/method/l2g/model.py

@@ -114,7 +114,7 @@ def predict(

        pd_dataframe.iteritems = pd_dataframe.items

-        feature_matrix_pdf = feature_matrix.df.toPandas()
+        feature_matrix_pdf = feature_matrix._df.toPandas()


The _df should be a private attribute. Why changing it?

xyg123 · 2024-09-19T13:29:56Z

Oveall the solution looks nice, for now on adding new features should be straight forward. @xyg123 please check the feature implementations, as I do not have so much experience with the distance features.

All looks good, didn't see the distance features on this PR, but the coloc features looks consistent with what we had before

src/gentropy/dataset/l2g_feature.py

…es_list`

…o left metadata

…_gene cant take a gold standard

…-3252

ireneisdoomed · 2024-09-23T09:39:07Z

I had to make changes after adding semantic tests for extract_maximum_coloc_probability_per_region_and_gene and realising there was a problem when the input dataset to annotate was of type L2GGoldStandard.

The problem was that when I wanted to annotate a gold standard with the maximum colocalisation score from a eQTL for a gene, I was only calculating it for those study loci present in the gold standard. For example:

sample_gold_standard.df.show()
+------------+---------+-------+------+---------------+----------+
|studyLocusId|variantId|studyId|geneId|goldStandardSet|   sources|
+------------+---------+-------+------+---------------+----------+
|           1|     var1|  gwas1|    g1|       positive|[a_source]|
+------------+---------+-------+------+---------------+----------+

sample_colocalisation.df.show()
+----------------+-----------------+----------+--------------------+--------------------------+---+
|leftStudyLocusId|rightStudyLocusId|chromosome|colocalisationMethod|numberColocalisingVariants| h4|
+----------------+-----------------+----------+--------------------+--------------------------+---+
|               1|                2|         X|               COLOC|                         1|0.9|
+----------------+-----------------+----------+--------------------+--------------------------+---+

sample_study_index.df.show()
+-------+---------+------+---------+
|studyId|studyType|geneId|projectId|
+-------+---------+------+---------+
|  gwas1|     gwas|  null|       p1|
|  eqtl1|     eqtl|    g1|       p2|
+-------+---------+------+---------+

This function has a step where it adds the study type and the gene from the QTL study (the right side). Because the gold standard doesn't have all possible study loci, when I run the function the metadata is blank:

sample_colocalisation.append_right_study_metadata(sample_gold_standard, sample_study_index, ["studyType", "geneId"]).show()
+-----------------+------------+--------------+-----------+----------------+----------+--------------------+--------------------------+---+
|rightStudyLocusId|rightStudyId|rightStudyType|rightGeneId|leftStudyLocusId|chromosome|colocalisationMethod|numberColocalisingVariants| h4|
+-----------------+------------+--------------+-----------+----------------+----------+--------------------+--------------------------+---+
|                2|        null|          null|       null|               1|         X|               COLOC|                         1|0.9|
+-----------------+------------+--------------+-----------+----------------+----------+--------------------+--------------------------+---+

The solution was to pass study locus as a dependency of the colocalisation feature factories.

ireneisdoomed · 2024-09-23T09:41:23Z

@xyg123 @project-defiant Sorry it took me a while to merge. Thank you for your suggestions! There is no major change in the logic, so I'll merge unless you say otherwise.

…-3252

refactor(L2GFeatureMatrix): remove schema validation

50d98ed

github-actions bot added size-M Method Refactor Dataset Step labels Sep 3, 2024

ireneisdoomed added 2 commits September 3, 2024 13:44

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

66a0f0b

…-3252

refactor(FeatureFactory): reshape feature generation WIP

e1f7c5c

github-actions bot added size-L and removed size-M labels Sep 3, 2024

pre-commit-ci bot and others added 11 commits September 3, 2024 17:48

chore: pre-commit auto fixes [...]

a7757ac

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

646a810

…-3252

chore: set l2gfeature properties with decorator

8a70bf2

chore(l2gfeature): make credible_set and input_dependency instance at…

c690ffc

…tributes

chore(l2gfeature): make credible_set and input_dependency instance at…

a54e694

…tributes

chore(featurefactory): distanceTssMeanFeature working

85a7bf4

refactor(l2g): improve step dependency management

d24de6d

feat: implement

6a3af69

chore: fix mypy issues

09d5291

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

6211d8d

…-3252

Merge branch 'il-3252' of https://github.com/opentargets/gentropy int…

5561b74

…o il-3252

github-actions bot added size-XL and removed size-L labels Sep 9, 2024

ireneisdoomed added 3 commits September 9, 2024 17:41

feat: l2gfeaturematrix.from_features_list working

b1f607b

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

021e159

…-3252

chore: comment out obsolete refs

da20073

github-actions bot added the documentation Improvements or additions to documentation label Sep 10, 2024

ireneisdoomed added 3 commits September 10, 2024 09:06

chore(L2GFeatureMatrix): change mode attribute to with_gold_standard

d06c059

refactor(l2g): move feature matrix writing to training module

0a007a7

feat(L2GFeatureMatrix): accept L2GGoldStandard or StudyLocus as inputs

abfdf22

addramir requested a review from xyg123 September 16, 2024 12:30

ireneisdoomed linked an issue Sep 18, 2024 that may be closed by this pull request

Optimise feature matrix management to accelerate L2G Training and Prediction opentargets/issues#3252

Closed

2 tasks

ireneisdoomed mentioned this pull request Sep 18, 2024

feat: drop v2g and reimplement distance features #771

Merged

9 tasks

ireneisdoomed added 4 commits September 18, 2024 17:06

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

3fa9b55

…-3252

chore: drop config yamls

95793c6

refactor: move feature classes to datasets module

cb5c169

docs: update feature docs

d3498b4

project-defiant approved these changes Sep 19, 2024

View reviewed changes

ireneisdoomed added 2 commits September 19, 2024 14:16

refactor(colocalisation): cleaner joins in append_right_study_metadata

ead7288

chore: better logging abstract methods

8c95bd4

project-defiant reviewed Sep 19, 2024

View reviewed changes

src/gentropy/dataset/l2g_feature.py Show resolved Hide resolved

ireneisdoomed added 10 commits September 20, 2024 09:28

test: add L2GFeatureMatrix.test_from_features_list unit tests

8e2460e

fix: add goldStandardSet when a gs instance is passed to `from_featur…

f9d9fd4

…es_list`

fix: lowercase colocalisation type and add semantic test

5b21367

test: add semantic test for append_right_study_metadata

25e0c45

feat(colocalisation): make append_right_study_metadata extensible t…

a322b7c

…o left metadata

fix(colocalisation): append_study_metadata cant take a gold standard

7da2102

fix(colocalisation): extract_maximum_coloc_probability_per_region_and…

a25e66e

…_gene cant take a gold standard

feat: add StudyLocus as a dependency of colocalisation features

3d463d9

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

e889be6

…-3252

fix: add studylocus to input loader in test

80b62dd

ireneisdoomed added 3 commits September 23, 2024 10:48

fix: add studylocus to input loader in test

d863a33

fix: add studylocus to input loader in test

b17c538

Merge branch 'dev' of https://github.com/opentargets/gentropy into il…

0675972

…-3252

ireneisdoomed changed the title ~~refactor(L2GFeatureMatrix): streamline feature matrix management~~ refactor(L2GFeatureMatrix)!: streamline feature matrix management Sep 23, 2024

ireneisdoomed merged commit b93842a into dev Sep 23, 2024
5 checks passed

ireneisdoomed deleted the il-3252 branch September 23, 2024 12:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(L2GFeatureMatrix)!: streamline feature matrix management #745

refactor(L2GFeatureMatrix)!: streamline feature matrix management #745

ireneisdoomed commented Sep 3, 2024 •

edited

Loading

project-defiant left a comment

project-defiant Sep 19, 2024

ireneisdoomed Sep 19, 2024 •

edited

Loading

project-defiant Sep 19, 2024

project-defiant Sep 19, 2024

ireneisdoomed Sep 19, 2024

project-defiant Sep 19, 2024

project-defiant Sep 19, 2024

project-defiant Sep 19, 2024

project-defiant Sep 19, 2024

project-defiant Sep 19, 2024

project-defiant Sep 19, 2024

project-defiant Sep 19, 2024

xyg123 commented Sep 19, 2024

ireneisdoomed commented Sep 23, 2024

ireneisdoomed commented Sep 23, 2024

refactor(L2GFeatureMatrix)!: streamline feature matrix management #745

refactor(L2GFeatureMatrix)!: streamline feature matrix management #745

Conversation

ireneisdoomed commented Sep 3, 2024 • edited Loading

✨ Context

🛠 What does this PR implement

🙈 Missing

🚦 Before submitting

project-defiant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ireneisdoomed Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xyg123 commented Sep 19, 2024

ireneisdoomed commented Sep 23, 2024

ireneisdoomed commented Sep 23, 2024

ireneisdoomed commented Sep 3, 2024 •

edited

Loading

ireneisdoomed Sep 19, 2024 •

edited

Loading