-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add biosample index #769
base: dev
Are you sure you want to change the base?
Conversation
… in biosample index
Few notes:
|
"name": "dbXrefs", | ||
"type": { | ||
"type": "array", | ||
"elementType": "string", | ||
"containsNull": true | ||
}, | ||
"nullable": true, | ||
"metadata": {} | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not really good. We already have this column in other datasets eg. VariantIndex + others in the platform. The expectation from this filed is that it is a list of structs, where each element not only informs about the identifier, but also about the source.:
{
"metadata": {},
"name": "dbXrefs",
"nullable": true,
"type": {
"containsNull": true,
"elementType": {
"fields": [
{
"metadata": {},
"name": "id",
"nullable": true,
"type": "string"
},
{
"metadata": {},
"name": "source",
"nullable": true,
"type": "string"
}
],
"type": "struct"
},
"type": "array"
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, sorta... See my comment on the pull request. It wasn't immediately apparent how to structure it
@@ -26,7 +26,7 @@ This section contains information about the data source harmonisation tools avai | |||
2. GWAS catalog's [harmonisation pipeline](https://www.ebi.ac.uk/gwas/docs/methods/summary-statistics#_harmonised_summary_statistics_data) | |||
3. Ensembl's [Variant Effect Predictor](https://www.ensembl.org/info/docs/tools/vep/index.html) | |||
|
|||
## Linkage desiquilibrium |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch!
src/gentropy/config.py
Outdated
target_path: str = MISSING | ||
biosample_index_path: str = MISSING | ||
_target_: str = "gentropy.biosample_index.BiosampleIndexStep" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The configuration should have all the inputs that the step needs.
cell_ontology_input_path
uberon_input_path
biosample_index_output_path
All can/should be missing in this case, because there's no hardcoded parameters eg. r2 threshold for LD index. Similarly, target_path
needs to be dropped as this parameter is not part of the BiosampleIndexStep
parameters.
from pyspark.sql.functions import ( | ||
array_distinct, | ||
coalesce, | ||
col, | ||
collect_list, | ||
collect_set, | ||
explode_outer, | ||
first, | ||
regexp_replace, | ||
udf, | ||
) | ||
from pyspark.sql.types import ArrayType, StringType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just stylistic comment, but we do not import spark functions like this. There's a not too serious explanation for this: some of the functions might overwrite Python base function eg. from pyspark.sql.functions import sum, max
, that's why we import all functions with a prefix:
import pyspark.sql.functions as f
# then calling a function would look like this:
f.array_distinct(f.array_union(f.col('colname')....
It's quite marginal, but being stylistically consistent is good.
def merge_biosample_indices( | ||
biosample_indices : list[BiosampleIndex] | ||
) -> BiosampleIndex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't add this function here. The merge of biosample indices is kind of an inherent ability of biosample indices as well. Eg.
a: BiosampleIndex
b: BiosampleIndex
merged: BiosampleIndex = a.merge_indices(b)
Assuming the operation is commutative, and b.merge(a)
yields the same index.
So my advise is to move this logic as a instance method of the BiosampleIndex
dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also this would imply one fewer function to import, as the merge function travels wherever the biosample object goes)
) -> BiosampleIndex: | ||
"""Merge a list of biosample indices into a single biosample index. | ||
|
||
Where there are conflicts, in single values - the first value is taken. In list values, the union of all values is taken. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this function is not entirely commutative, but I think this is still fine.
BiosampleIndex: Merged biosample index. | ||
""" | ||
|
||
def merge_lists( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a great function, nice elegant logic! A few comments:
- I would add a description explaining that the order of element are not kept (lists are ordered structures).
- It returns with a unique set of elements, instead of all elements.
- I would put this function under
common.utils.py
, because such function might be useful in other places.
Question: Should this function be generalised to flatten arbitrarily deeply nested lists? (via recursion?)
return list({item for sublist in lists if sublist is not None for item in sublist}) | ||
|
||
# Make a spark udf (user defined function) to merge lists | ||
merge_lists_udf = udf(merge_lists, ArrayType(StringType())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, if you want to flatten an array of array column in pyspark, you shouldn't use udfs! (in general udfs should be avoided if possible, because the interface between python and java layer makes the execution very inefficient and poorly scalable)
It works like this:
f.flatten(f.col('array_of_array_column"))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you need a unique set of elements from the flattened array you can use:
f.array_distinct(f.flatten(f.col('array_of_array_column")))
@@ -22,6 +23,7 @@ def __init__( | |||
study_index_path: list[str], | |||
target_index_path: str, | |||
disease_index_path: str, | |||
biosample_index_path: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As new parameter is added to the validation step, this new parameter needs to be added to the config file here.
def validate_biosample(self: StudyIndex, biosample_index: BiosampleIndex) -> StudyIndex: | ||
"""Validating biosample identifiers in the study index against the provided biosample index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a critique, rather a comment: we are doing exactly the same steps for genes (disease validation is more complicated). I'm wondering if we can have one function that would be used by both the biosample and gene validation... eg. We would pass the following parameters to this function:
- this is the list of fields you need to find in
- this column, and if you don't find, add
- this flag...
Not exactly sure how to implement this, but sounds quite abstract that could be used in other places and other datasets maybe.
Co-authored-by: Daniel Suveges <[email protected]>
…ts/gentropy into alegbe-biosample_index
✨ Context
🛠 What does this PR implement
🙈 Missing
🚦 Before submitting
dev
branch?make test
)?poetry run pre-commit run --all-files
)?