-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add biosample index #769
base: dev
Are you sure you want to change the base?
Changes from all commits
6fabc98
e8a3775
c4d6d5f
186e773
6f0a2e2
01917a7
55e2baf
692732b
30dc23f
28e1f92
33ebf58
26a4295
1c507e6
567d8e1
12293d3
850f910
c42bdd6
07daedc
fb98e15
b150122
4fb8d05
978f636
2364f8d
6f99147
ec4edf3
cf00504
ad947cb
729f492
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
--- | ||
title: Biosample index | ||
--- | ||
|
||
::: gentropy.dataset.biosample_index.BiosampleIndex | ||
|
||
## Schema | ||
|
||
--8<-- "assets/schemas/biosample_index.md" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
--- | ||
title: Cell Ontology | ||
--- | ||
|
||
The [Cell Ontology](http://www.obofoundry.org/ontology/cl.html) is a structured controlled vocabulary for cell types. It is used to annotate cell types in single-cell RNA-seq data and other omics data. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
--- | ||
title: Uberon | ||
--- | ||
|
||
The [Uberon](http://uberon.github.io/) ontology is a multi-species anatomy ontology that integrates cross-species ontologies into a single ontology. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
--- | ||
title: biosample_index | ||
--- | ||
|
||
::: gentropy.biosample_index.BiosampleIndexStep |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
{ | ||
"type": "struct", | ||
"fields": [ | ||
{ | ||
"name": "biosampleId", | ||
"type": "string", | ||
"nullable": false, | ||
"metadata": {} | ||
}, | ||
{ | ||
"name": "biosampleName", | ||
"type": "string", | ||
"nullable": true, | ||
"metadata": {} | ||
}, | ||
{ | ||
"name": "description", | ||
"type": "string", | ||
"nullable": true, | ||
"metadata": {} | ||
}, | ||
{ | ||
"name": "xrefs", | ||
"type": { | ||
"type": "array", | ||
"elementType": "string", | ||
"containsNull": true | ||
}, | ||
"nullable": true, | ||
"metadata": {} | ||
}, | ||
{ | ||
"name": "synonyms", | ||
"type": { | ||
"type": "array", | ||
"elementType": "string", | ||
"containsNull": true | ||
}, | ||
"nullable": true, | ||
"metadata": {} | ||
}, | ||
{ | ||
"name": "parents", | ||
"type": { | ||
"type": "array", | ||
"elementType": "string", | ||
"containsNull": true | ||
}, | ||
"nullable": true, | ||
"metadata": {} | ||
}, | ||
{ | ||
"name": "ancestors", | ||
"type": { | ||
"type": "array", | ||
"elementType": "string", | ||
"containsNull": true | ||
}, | ||
"nullable": true, | ||
"metadata": {} | ||
}, | ||
{ | ||
"name": "descendants", | ||
"type": { | ||
"type": "array", | ||
"elementType": "string", | ||
"containsNull": true | ||
}, | ||
"nullable": true, | ||
"metadata": {} | ||
}, | ||
{ | ||
"name": "children", | ||
"type": { | ||
"type": "array", | ||
"elementType": "string", | ||
"containsNull": true | ||
}, | ||
"nullable": true, | ||
"metadata": {} | ||
} | ||
] | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
"""Step to generate biosample index dataset.""" | ||
from __future__ import annotations | ||
|
||
from gentropy.common.session import Session | ||
from gentropy.datasource.biosample_ontologies.utils import extract_ontology_from_json | ||
|
||
|
||
class BiosampleIndexStep: | ||
"""Biosample index step. | ||
|
||
This step generates a Biosample index dataset from the various ontology sources. Currently Cell Ontology and Uberon are supported. | ||
""" | ||
|
||
def __init__( | ||
self, | ||
session: Session, | ||
cell_ontology_input_path: str, | ||
uberon_input_path: str, | ||
biosample_index_path: str, | ||
) -> None: | ||
"""Run Biosample index generation step. | ||
|
||
Args: | ||
session (Session): Session object. | ||
cell_ontology_input_path (str): Input cell ontology dataset path. | ||
uberon_input_path (str): Input uberon dataset path. | ||
biosample_index_path (str): Output gene index dataset path. | ||
""" | ||
cell_ontology_index = extract_ontology_from_json(cell_ontology_input_path, session.spark) | ||
uberon_index = extract_ontology_from_json(uberon_input_path, session.spark) | ||
|
||
biosample_index = cell_ontology_index.merge_indices([uberon_index]) | ||
|
||
biosample_index.df.write.mode(session.write_mode).parquet(biosample_index_path) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
"""Biosample index dataset.""" | ||
|
||
from __future__ import annotations | ||
|
||
from dataclasses import dataclass | ||
from functools import reduce | ||
from typing import TYPE_CHECKING | ||
|
||
import pyspark.sql.functions as f | ||
from pyspark.sql import DataFrame | ||
from pyspark.sql.types import ArrayType, StringType | ||
|
||
from gentropy.common.schemas import parse_spark_schema | ||
from gentropy.dataset.dataset import Dataset | ||
|
||
if TYPE_CHECKING: | ||
from pyspark.sql.types import StructType | ||
|
||
|
||
@dataclass | ||
class BiosampleIndex(Dataset): | ||
"""Biosample index dataset. | ||
|
||
A Biosample index dataset captures the metadata of the biosamples (e.g. tissues, cell types, cell lines, etc) such as alternate names and relationships with other biosamples. | ||
""" | ||
|
||
@classmethod | ||
def get_schema(cls: type[BiosampleIndex]) -> StructType: | ||
"""Provide the schema for the BiosampleIndex dataset. | ||
|
||
Returns: | ||
StructType: The schema of the BiosampleIndex dataset. | ||
""" | ||
return parse_spark_schema("biosample_index.json") | ||
|
||
@classmethod | ||
def merge_indices( | ||
cls: type[BiosampleIndex], | ||
biosample_indices : list[BiosampleIndex] | ||
) -> BiosampleIndex: | ||
"""Merge a list of biosample indices into a single biosample index. | ||
|
||
Where there are conflicts, in single values - the first value is taken. In list values, the union of all values is taken. | ||
|
||
Args: | ||
biosample_indices (list[BiosampleIndex]): Biosample indices to merge. | ||
|
||
Returns: | ||
BiosampleIndex: Merged biosample index. | ||
""" | ||
# Extract the DataFrames from the BiosampleIndex objects | ||
biosample_dfs = [biosample_index.df for biosample_index in biosample_indices] + [cls.df] | ||
|
||
# Merge the DataFrames | ||
merged_df = reduce(DataFrame.unionAll, biosample_dfs) | ||
|
||
# Determine aggregation functions for each column | ||
# Currently this will take the first value for single values and merge lists for list values | ||
agg_funcs = [] | ||
for field in merged_df.schema.fields: | ||
if field.name != "biosampleId": # Skip the grouping column | ||
if field.dataType == ArrayType(StringType()): | ||
agg_funcs.append(f.array_distinct(f.flatten(f.col(field.name))).alias(field.name)) | ||
else: | ||
agg_funcs.append(f.first(f.col(field.name), ignorenulls=True).alias(field.name)) | ||
|
||
# Perform aggregation | ||
aggregated_df = merged_df.groupBy("biosampleId").agg(*agg_funcs) | ||
|
||
return BiosampleIndex( | ||
_df=aggregated_df, | ||
_schema=BiosampleIndex.get_schema() | ||
) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,6 +19,7 @@ | |
from pyspark.sql import Column, DataFrame | ||
from pyspark.sql.types import StructType | ||
|
||
from gentropy.dataset.biosample_index import BiosampleIndex | ||
from gentropy.dataset.gene_index import GeneIndex | ||
|
||
|
||
|
@@ -29,13 +30,15 @@ class StudyQualityCheck(Enum): | |
UNRESOLVED_TARGET (str): Target/gene identifier could not match to reference - Labelling failing target. | ||
UNRESOLVED_DISEASE (str): Disease identifier could not match to referece or retired identifier - labelling failing disease | ||
UNKNOWN_STUDY_TYPE (str): Indicating the provided type of study is not supported. | ||
UNKNOWN_BIOSAMPLE (str): Flagging if a biosample identifier is not found in the reference. | ||
DUPLICATED_STUDY (str): Flagging if a study identifier is not unique. | ||
NO_GENE_PROVIDED (str): Flagging QTL studies if the measured | ||
""" | ||
|
||
UNRESOLVED_TARGET = "Target/gene identifier could not match to reference." | ||
UNRESOLVED_DISEASE = "No valid disease identifier found." | ||
UNKNOWN_STUDY_TYPE = "This type of study is not supported." | ||
UNKNOWN_BIOSAMPLE = "Biosample identifier was not found in the reference." | ||
DUPLICATED_STUDY = "The identifier of this study is not unique." | ||
NO_GENE_PROVIDED = "QTL study doesn't have gene assigned." | ||
|
||
|
@@ -408,3 +411,36 @@ def validate_target(self: StudyIndex, target_index: GeneIndex) -> StudyIndex: | |
) | ||
|
||
return StudyIndex(_df=validated_df, _schema=StudyIndex.get_schema()) | ||
|
||
def validate_biosample(self: StudyIndex, biosample_index: BiosampleIndex) -> StudyIndex: | ||
"""Validating biosample identifiers in the study index against the provided biosample index. | ||
Comment on lines
+415
to
+416
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not a critique, rather a comment: we are doing exactly the same steps for genes (disease validation is more complicated). I'm wondering if we can have one function that would be used by both the biosample and gene validation... eg. We would pass the following parameters to this function:
Not exactly sure how to implement this, but sounds quite abstract that could be used in other places and other datasets maybe. |
||
|
||
Args: | ||
biosample_index (BiosampleIndex): Biosample index containing a reference of biosample identifiers e.g. cell types, tissues, cell lines, etc. | ||
|
||
Returns: | ||
StudyIndex: with flagged studies if biosampleIndex could not be validated. | ||
""" | ||
biosample_set = biosample_index.df.select("biosampleId", f.lit(True).alias("isIdFound")) | ||
|
||
validated_df = ( | ||
self.df.join(biosample_set, self.df.biosampleFromSourceId == biosample_set.biosampleId, how="left") | ||
.withColumn( | ||
"isIdFound", | ||
f.when( | ||
f.col("isIdFound").isNull(), | ||
f.lit(False), | ||
).otherwise(f.lit(True)), | ||
) | ||
.withColumn( | ||
"qualityControls", | ||
StudyIndex.update_quality_flag( | ||
f.col("qualityControls"), | ||
~f.col("isIdFound"), | ||
StudyQualityCheck.UNKNOWN_BIOSAMPLE, | ||
), | ||
) | ||
.drop("isIdFound").drop("biosampleId") | ||
) | ||
|
||
return StudyIndex(_df=validated_df, _schema=StudyIndex.get_schema()) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
"""Biosample index data source.""" | ||
|
||
from __future__ import annotations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch!