-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(L2GFeatureMatrix)!: streamline feature matrix management #745
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oveall the solution looks nice, for now on adding new features should be straight forward.
@xyg123 please check the feature implementations, as I do not have so much experience with the distance features.
f"Colocalisation method {filter_by_colocalisation_method} is not supported." | ||
) | ||
|
||
method_colocalisation_metric = ColocalisationStep._get_colocalisation_class( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the ColocalisationStep
might introduce a dependency cycle. In case you need this function, it's better to move it in this class directly and then call it in this place as well as in the ColocalisationStep directly.
The method should also ensure that the coloc method defined in filter_by_colocalisation_method
is a valid name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is used in the ColocalisationStep directly. It is a mapper between the method name and the method class. How would that introduce a dependency issue?
The method should also ensure that the coloc method defined in filter_by_colocalisation_method is a valid name.
This is happening
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The imports inside the function are a matter to discuss, you usually should place them at the top (as PEP8) suggests it to do. I normally to a think like this when I would introduce a dependency cycle (which should be avoided). But moving this method to the bottom level module seems more appropriate in this case.
- Having such a method call introduces a risk, when someone tries to import back the
Colocalisation
dataset in the step directly.
For now it might not be a problem, but it's not standard approach and can cause trouble in the future, that is all :)
""" | ||
# TODO: make this flexible to bring metadata from the left study (2 joins) | ||
return self.df.join( | ||
study_loci.df.selectExpr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be on the safe side, I would make the joins between studyLocus and studyIndex first and drop duplicates (if any) on the id pair before rejoining it to the coloc df. Curious about your opinion on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like this? I'm fine with that, probably it is more readable because it doesn't start operating with the left/right nomenclature until later
return (
# Annotate study loci with study metadata
study_loci.df.select("studyLocusId", "studyId")
.join(
f.broadcast(study_index.df.select("studyId", *metadata_cols)), "studyId"
)
# Append that to the right side of the colocalisation dataset
.selectExpr(
"studyLocusId as rightStudyLocusId",
*[f"{col} as right{col[0].upper() + col[1:]}" for col in metadata_cols],
)
.join(self.df, "rightStudyLocusId", "right")
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but my key point was to add the drop_duplicates(["studyLocusId"])
credible_set: StudyLocus | None = None, | ||
) -> None: | ||
"""Initializes a L2GFeature dataset. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add to the docs, that this class should contain the common
methods used to build the instances of Feature datasets.
) | ||
|
||
|
||
class PQtlColocClppMaximumFeature(L2GFeature): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I can see this is very verbose. This will be a field to improve.
@@ -145,7 +120,7 @@ def fill_na( | |||
Returns: | |||
L2GFeatureMatrix: L2G feature matrix dataset | |||
""" | |||
self.df = self._df.fillna(value, subset=subset) | |||
self._df = self._df.fillna(value, subset=subset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create new object instead of overwriting the _df
. I am planning to make it as a feature, so the Dataset objects are immutable.
feature_class.feature_dependency_type | ||
), | ||
) | ||
assert isinstance(feature_dataset, L2GFeature) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to test if all of the requested features are actually in the dataframe columns.
) | ||
expected_cols = ["studyLocusId", "geneId", "h4"] | ||
for col in expected_cols: | ||
assert col in res_df.columns, f"Column {col} not found in result DataFrame." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explicitly compare the expected values with the output column values
@@ -114,7 +114,7 @@ def predict( | |||
|
|||
pd_dataframe.iteritems = pd_dataframe.items | |||
|
|||
feature_matrix_pdf = feature_matrix.df.toPandas() | |||
feature_matrix_pdf = feature_matrix._df.toPandas() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _df
should be a private attribute. Why changing it?
All looks good, didn't see the distance features on this PR, but the coloc features looks consistent with what we had before |
…_gene cant take a gold standard
I had to make changes after adding semantic tests for extract_maximum_coloc_probability_per_region_and_gene and realising there was a problem when the input dataset to annotate was of type The problem was that when I wanted to annotate a gold standard with the maximum colocalisation score from a eQTL for a gene, I was only calculating it for those study loci present in the gold standard. For example: sample_gold_standard.df.show()
+------------+---------+-------+------+---------------+----------+
|studyLocusId|variantId|studyId|geneId|goldStandardSet| sources|
+------------+---------+-------+------+---------------+----------+
| 1| var1| gwas1| g1| positive|[a_source]|
+------------+---------+-------+------+---------------+----------+
sample_colocalisation.df.show()
+----------------+-----------------+----------+--------------------+--------------------------+---+
|leftStudyLocusId|rightStudyLocusId|chromosome|colocalisationMethod|numberColocalisingVariants| h4|
+----------------+-----------------+----------+--------------------+--------------------------+---+
| 1| 2| X| COLOC| 1|0.9|
+----------------+-----------------+----------+--------------------+--------------------------+---+
sample_study_index.df.show()
+-------+---------+------+---------+
|studyId|studyType|geneId|projectId|
+-------+---------+------+---------+
| gwas1| gwas| null| p1|
| eqtl1| eqtl| g1| p2|
+-------+---------+------+---------+ This function has a step where it adds the study type and the gene from the QTL study (the right side). Because the gold standard doesn't have all possible study loci, when I run the function the metadata is blank:
The solution was to pass study locus as a dependency of the colocalisation feature factories. |
@xyg123 @project-defiant Sorry it took me a while to merge. Thank you for your suggestions! There is no major change in the logic, so I'll merge unless you say otherwise. |
✨ Context
Rewrite of all the logic to generate features. Not all previous features are available (more below); I want to know if this POC is good to keep working.
L2GFeatureMatrix.from_features_list
will be the main entry point. It reads a list of feature names from the config and instantiates the FeatureFactory.FeatureFactory.generate_features
is the responsible of iterating over this feature list.A schema of the new design is here:
https://excalidraw.com/#json=FSOwgU37GReVJuRGCjE6f,LAfIi3o_VJ635XCHNBmG7g
🛠 What does this PR implement
Feature Factory changes:
L2GFeatureMatrix
is no longer a dataset subject to schema validationL2GFeature
is an abstract class with 2 attributes:study_loci_to_annotate
andfeature_dependency_type
and one key methodcompute
. study_loci_to_annotate is the gold standard or the study locus object on which the feature will be based, and feature_dependency_type is the type(s) of the objects required to compute that evidenceL2GFeatureInputLoader
, a class in feature factory that generates a dictionary with the dependencies to compute features. The key is the name of the dependency, the value is the content. It is generated reading kwargs.Colocalisation changes:
Colocalisation.extract_maximum_coloc_probability_per_region_and_gene
. The code we have to write per feature is now minimal.Colocalisation.append_right_study_metadata
to bring the study type and the gene info from the right associationOther:
StudyLocus.build_feature_matrix
to call within the stepL2GGoldStandard.build_feature_matrix
to call within the step🙈 Missing
feature_factory.py
module per feature group?🚦 Before submitting
dev
branch?make test
)?poetry run pre-commit run --all-files
)?