Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add biosample index #769

Open
wants to merge 28 commits into
base: dev
Choose a base branch
from
Open

feat: add biosample index #769

wants to merge 28 commits into from

Conversation

Tobi1kenobi
Copy link

@Tobi1kenobi Tobi1kenobi commented Sep 18, 2024

✨ Context

🛠 What does this PR implement

🙈 Missing

🚦 Before submitting

  • Do these changes cover one single feature (one change at a time)?
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes?
  • Did you make sure there is no commented out code in this PR?
  • Did you follow conventional commits standards in PR title and commit messages?
  • Did you make sure the branch is up-to-date with the dev branch?
  • Did you write any new necessary tests?
  • Did you make sure the changes pass local tests (make test)?
  • Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

@Tobi1kenobi Tobi1kenobi marked this pull request as ready for review September 18, 2024 15:39
@Tobi1kenobi
Copy link
Author

Tobi1kenobi commented Sep 18, 2024

Few notes:

  • When joining onto eQTL catalogue three biosampleIds have no match in the biosample index: BTO_0000930 (neuroblast), EFO_0005292 (lymphoblastoid cell line), EFO_0004905 (induced pluripotent stem cell). When thinking about the other use cases for this index some thought should go into the best way to capture these biosamples i.e. what is the best ontology for cell lines.
  • Current no column for deprecated or not. Not super relevant for current purposes of left-joining and wasn't entirely clear how to implement (drop deprecated, flag them, etc)
  • dbXrefs currently a flat list of strings rather than split into ID and as done for variant index. This was partially for simplicity, partially because I was unsure how best to split (e.g. CL_1001593 -> [CL_1001593, CL], [CL_1001593, Cell Ontology], [1001593, CL], [CL_1001593, Uberon (ontology actually sourced from)] and partially because I didn't see an immediate use case unlike variant index.

Comment on lines 23 to 31
"name": "dbXrefs",
"type": {
"type": "array",
"elementType": "string",
"containsNull": true
},
"nullable": true,
"metadata": {}
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really good. We already have this column in other datasets eg. VariantIndex + others in the platform. The expectation from this filed is that it is a list of structs, where each element not only informs about the identifier, but also about the source.:

    {
      "metadata": {},
      "name": "dbXrefs",
      "nullable": true,
      "type": {
        "containsNull": true,
        "elementType": {
          "fields": [
            {
              "metadata": {},
              "name": "id",
              "nullable": true,
              "type": "string"
            },
            {
              "metadata": {},
              "name": "source",
              "nullable": true,
              "type": "string"
            }
          ],
          "type": "struct"
        },
        "type": "array"
      }
    }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, sorta... See my comment on the pull request. It wasn't immediately apparent how to structure it

@@ -26,7 +26,7 @@ This section contains information about the data source harmonisation tools avai
2. GWAS catalog's [harmonisation pipeline](https://www.ebi.ac.uk/gwas/docs/methods/summary-statistics#_harmonised_summary_statistics_data)
3. Ensembl's [Variant Effect Predictor](https://www.ensembl.org/info/docs/tools/vep/index.html)

## Linkage desiquilibrium
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!

Comment on lines 58 to 60
target_path: str = MISSING
biosample_index_path: str = MISSING
_target_: str = "gentropy.biosample_index.BiosampleIndexStep"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configuration should have all the inputs that the step needs.

  • cell_ontology_input_path
  • uberon_input_path
  • biosample_index_output_path

All can/should be missing in this case, because there's no hardcoded parameters eg. r2 threshold for LD index. Similarly, target_path needs to be dropped as this parameter is not part of the BiosampleIndexStep parameters.

Comment on lines 5 to 16
from pyspark.sql.functions import (
array_distinct,
coalesce,
col,
collect_list,
collect_set,
explode_outer,
first,
regexp_replace,
udf,
)
from pyspark.sql.types import ArrayType, StringType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just stylistic comment, but we do not import spark functions like this. There's a not too serious explanation for this: some of the functions might overwrite Python base function eg. from pyspark.sql.functions import sum, max, that's why we import all functions with a prefix:

import pyspark.sql.functions as f

# then calling a function would look like this:
f.array_distinct(f.array_union(f.col('colname')....

It's quite marginal, but being stylistically consistent is good.

Comment on lines 145 to 147
def merge_biosample_indices(
biosample_indices : list[BiosampleIndex]
) -> BiosampleIndex:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't add this function here. The merge of biosample indices is kind of an inherent ability of biosample indices as well. Eg.

a: BiosampleIndex
b: BiosampleIndex
merged: BiosampleIndex = a.merge_indices(b)

Assuming the operation is commutative, and b.merge(a) yields the same index.

So my advise is to move this logic as a instance method of the BiosampleIndex dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also this would imply one fewer function to import, as the merge function travels wherever the biosample object goes)

) -> BiosampleIndex:
"""Merge a list of biosample indices into a single biosample index.

Where there are conflicts, in single values - the first value is taken. In list values, the union of all values is taken.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this function is not entirely commutative, but I think this is still fine.

BiosampleIndex: Merged biosample index.
"""

def merge_lists(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a great function, nice elegant logic! A few comments:

  • I would add a description explaining that the order of element are not kept (lists are ordered structures).
  • It returns with a unique set of elements, instead of all elements.
  • I would put this function under common.utils.py, because such function might be useful in other places.

Question: Should this function be generalised to flatten arbitrarily deeply nested lists? (via recursion?)

return list({item for sublist in lists if sublist is not None for item in sublist})

# Make a spark udf (user defined function) to merge lists
merge_lists_udf = udf(merge_lists, ArrayType(StringType()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, if you want to flatten an array of array column in pyspark, you shouldn't use udfs! (in general udfs should be avoided if possible, because the interface between python and java layer makes the execution very inefficient and poorly scalable)

It works like this:

f.flatten(f.col('array_of_array_column"))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need a unique set of elements from the flattened array you can use:

f.array_distinct(f.flatten(f.col('array_of_array_column")))

@@ -22,6 +23,7 @@ def __init__(
study_index_path: list[str],
target_index_path: str,
disease_index_path: str,
biosample_index_path: str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As new parameter is added to the validation step, this new parameter needs to be added to the config file here.

Comment on lines +415 to +416
def validate_biosample(self: StudyIndex, biosample_index: BiosampleIndex) -> StudyIndex:
"""Validating biosample identifiers in the study index against the provided biosample index.
Copy link
Contributor

@DSuveges DSuveges Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a critique, rather a comment: we are doing exactly the same steps for genes (disease validation is more complicated). I'm wondering if we can have one function that would be used by both the biosample and gene validation... eg. We would pass the following parameters to this function:

  • this is the list of fields you need to find in
  • this column, and if you don't find, add
  • this flag...

Not exactly sure how to implement this, but sounds quite abstract that could be used in other places and other datasets maybe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create biosample index to resolve Tissue/cell type entities
2 participants