Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(L2GFeatureMatrix)!: streamline feature matrix management #745

Merged
merged 48 commits into from
Sep 23, 2024

Conversation

ireneisdoomed
Copy link
Contributor

@ireneisdoomed ireneisdoomed commented Sep 3, 2024

✨ Context

Rewrite of all the logic to generate features. Not all previous features are available (more below); I want to know if this POC is good to keep working.

L2GFeatureMatrix.from_features_list will be the main entry point. It reads a list of feature names from the config and instantiates the FeatureFactory. FeatureFactory.generate_features is the responsible of iterating over this feature list.

A schema of the new design is here:
image
https://excalidraw.com/#json=FSOwgU37GReVJuRGCjE6f,LAfIi3o_VJ635XCHNBmG7g

🛠 What does this PR implement

Feature Factory changes:

  • L2GFeatureMatrix is no longer a dataset subject to schema validation
  • L2GFeature is an abstract class with 2 attributes: study_loci_to_annotate and feature_dependency_type and one key method compute. study_loci_to_annotate is the gold standard or the study locus object on which the feature will be based, and feature_dependency_type is the type(s) of the objects required to compute that evidence
  • Addition of L2GFeatureInputLoader, a class in feature factory that generates a dictionary with the dependencies to compute features. The key is the name of the dependency, the value is the content. It is generated reading kwargs.
    Colocalisation changes:
  • Rewrite of the colocalisation feature factory for the local features. Even though the coloc features are split by method and QTL type, I was able to abstract most of the core business inside Colocalisation.extract_maximum_coloc_probability_per_region_and_gene. The code we have to write per feature is now minimal.
  • Addition of Colocalisation.append_right_study_metadata to bring the study type and the gene info from the right association

Other:

  • Addition of StudyLocus.build_feature_matrix to call within the step
  • Addition of L2GGoldStandard.build_feature_matrix to call within the step
  • The feature matrix is now written during the training step. It makes more sense to dump the dataset that produced the model.
  • Unit and integration tests (the semantic ones in coloc are commented out)

🙈 Missing

  1. Should I split the feature_factory.py module per feature group?
  2. Not all previous features are available:
  • The local ones extracted from colocalisation are ready.
  • The neighborhood ones extracted from colocalisation are not there (these are derived from the local)
  • The ones derived from distance are pending to be included (and parsed from the variant index). I didn't want this PR to become even larger.

🚦 Before submitting

  • Do these changes cover one single feature (one change at a time)?
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes?
  • Did you make sure there is no commented out code in this PR?
  • Did you follow conventional commits standards in PR title and commit messages?
  • Did you make sure the branch is up-to-date with the dev branch?
  • Did you write any new necessary tests?
  • Did you make sure the changes pass local tests (make test)?
  • Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

@github-actions github-actions bot added size-L and removed size-M labels Sep 3, 2024
@github-actions github-actions bot added size-XL and removed size-L labels Sep 9, 2024
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Sep 10, 2024
Copy link
Contributor

@project-defiant project-defiant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oveall the solution looks nice, for now on adding new features should be straight forward.
@xyg123 please check the feature implementations, as I do not have so much experience with the distance features.

config/step/ot_locus_to_gene_predict.yaml Outdated Show resolved Hide resolved
f"Colocalisation method {filter_by_colocalisation_method} is not supported."
)

method_colocalisation_metric = ColocalisationStep._get_colocalisation_class(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the ColocalisationStep might introduce a dependency cycle. In case you need this function, it's better to move it in this class directly and then call it in this place as well as in the ColocalisationStep directly.

The method should also ensure that the coloc method defined in filter_by_colocalisation_method is a valid name.

Copy link
Contributor Author

@ireneisdoomed ireneisdoomed Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is used in the ColocalisationStep directly. It is a mapper between the method name and the method class. How would that introduce a dependency issue?

The method should also ensure that the coloc method defined in filter_by_colocalisation_method is a valid name.

This is happening

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The imports inside the function are a matter to discuss, you usually should place them at the top (as PEP8) suggests it to do. I normally to a think like this when I would introduce a dependency cycle (which should be avoided). But moving this method to the bottom level module seems more appropriate in this case.
  2. Having such a method call introduces a risk, when someone tries to import back the Colocalisation dataset in the step directly.

For now it might not be a problem, but it's not standard approach and can cause trouble in the future, that is all :)

"""
# TODO: make this flexible to bring metadata from the left study (2 joins)
return self.df.join(
study_loci.df.selectExpr(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be on the safe side, I would make the joins between studyLocus and studyIndex first and drop duplicates (if any) on the id pair before rejoining it to the coloc df. Curious about your opinion on that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like this? I'm fine with that, probably it is more readable because it doesn't start operating with the left/right nomenclature until later

        return (
            # Annotate study loci with study metadata
            study_loci.df.select("studyLocusId", "studyId")
            .join(
                f.broadcast(study_index.df.select("studyId", *metadata_cols)), "studyId"
            )
            # Append that to the right side of the colocalisation dataset
            .selectExpr(
                "studyLocusId as rightStudyLocusId",
                *[f"{col} as right{col[0].upper() + col[1:]}" for col in metadata_cols],
            )
            .join(self.df, "rightStudyLocusId", "right")
        )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but my key point was to add the drop_duplicates(["studyLocusId"])

src/gentropy/dataset/l2g_feature.py Outdated Show resolved Hide resolved
credible_set: StudyLocus | None = None,
) -> None:
"""Initializes a L2GFeature dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add to the docs, that this class should contain the common methods used to build the instances of Feature datasets.

)


class PQtlColocClppMaximumFeature(L2GFeature):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I can see this is very verbose. This will be a field to improve.

@@ -145,7 +120,7 @@ def fill_na(
Returns:
L2GFeatureMatrix: L2G feature matrix dataset
"""
self.df = self._df.fillna(value, subset=subset)
self._df = self._df.fillna(value, subset=subset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create new object instead of overwriting the _df. I am planning to make it as a feature, so the Dataset objects are immutable.

feature_class.feature_dependency_type
),
)
assert isinstance(feature_dataset, L2GFeature)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to test if all of the requested features are actually in the dataframe columns.

)
expected_cols = ["studyLocusId", "geneId", "h4"]
for col in expected_cols:
assert col in res_df.columns, f"Column {col} not found in result DataFrame."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explicitly compare the expected values with the output column values

@@ -114,7 +114,7 @@ def predict(

pd_dataframe.iteritems = pd_dataframe.items

feature_matrix_pdf = feature_matrix.df.toPandas()
feature_matrix_pdf = feature_matrix._df.toPandas()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _df should be a private attribute. Why changing it?

@xyg123
Copy link
Contributor

xyg123 commented Sep 19, 2024

Oveall the solution looks nice, for now on adding new features should be straight forward. @xyg123 please check the feature implementations, as I do not have so much experience with the distance features.

All looks good, didn't see the distance features on this PR, but the coloc features looks consistent with what we had before

@ireneisdoomed
Copy link
Contributor Author

I had to make changes after adding semantic tests for extract_maximum_coloc_probability_per_region_and_gene and realising there was a problem when the input dataset to annotate was of type L2GGoldStandard.

The problem was that when I wanted to annotate a gold standard with the maximum colocalisation score from a eQTL for a gene, I was only calculating it for those study loci present in the gold standard. For example:

sample_gold_standard.df.show()
+------------+---------+-------+------+---------------+----------+
|studyLocusId|variantId|studyId|geneId|goldStandardSet|   sources|
+------------+---------+-------+------+---------------+----------+
|           1|     var1|  gwas1|    g1|       positive|[a_source]|
+------------+---------+-------+------+---------------+----------+

sample_colocalisation.df.show()
+----------------+-----------------+----------+--------------------+--------------------------+---+
|leftStudyLocusId|rightStudyLocusId|chromosome|colocalisationMethod|numberColocalisingVariants| h4|
+----------------+-----------------+----------+--------------------+--------------------------+---+
|               1|                2|         X|               COLOC|                         1|0.9|
+----------------+-----------------+----------+--------------------+--------------------------+---+

sample_study_index.df.show()
+-------+---------+------+---------+
|studyId|studyType|geneId|projectId|
+-------+---------+------+---------+
|  gwas1|     gwas|  null|       p1|
|  eqtl1|     eqtl|    g1|       p2|
+-------+---------+------+---------+

This function has a step where it adds the study type and the gene from the QTL study (the right side). Because the gold standard doesn't have all possible study loci, when I run the function the metadata is blank:

sample_colocalisation.append_right_study_metadata(sample_gold_standard, sample_study_index, ["studyType", "geneId"]).show()
+-----------------+------------+--------------+-----------+----------------+----------+--------------------+--------------------------+---+
|rightStudyLocusId|rightStudyId|rightStudyType|rightGeneId|leftStudyLocusId|chromosome|colocalisationMethod|numberColocalisingVariants| h4|
+-----------------+------------+--------------+-----------+----------------+----------+--------------------+--------------------------+---+
|                2|        null|          null|       null|               1|         X|               COLOC|                         1|0.9|
+-----------------+------------+--------------+-----------+----------------+----------+--------------------+--------------------------+---+

The solution was to pass study locus as a dependency of the colocalisation feature factories.

@ireneisdoomed
Copy link
Contributor Author

@xyg123 @project-defiant Sorry it took me a while to merge. Thank you for your suggestions! There is no major change in the logic, so I'll merge unless you say otherwise.

@ireneisdoomed ireneisdoomed changed the title refactor(L2GFeatureMatrix): streamline feature matrix management refactor(L2GFeatureMatrix)!: streamline feature matrix management Sep 23, 2024
@ireneisdoomed ireneisdoomed merged commit b93842a into dev Sep 23, 2024
5 checks passed
@ireneisdoomed ireneisdoomed deleted the il-3252 branch September 23, 2024 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimise feature matrix management to accelerate L2G Training and Prediction
4 participants