Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add biosample index #769

Open
wants to merge 28 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
6fabc98
Initial commit of biosample index
Tobi1kenobi Sep 9, 2024
e8a3775
Make minimal class
Tobi1kenobi Sep 9, 2024
c4d6d5f
Tidy up first draft of adding biosample index
Tobi1kenobi Sep 9, 2024
186e773
Add beginning of logic for checking if biosample from a studyindex is…
Tobi1kenobi Sep 10, 2024
6f0a2e2
Make early file for merging multiple biosample indices into one
Tobi1kenobi Sep 10, 2024
01917a7
Merge branch 'dev' into alegbe-biosample_index
Tobi1kenobi Sep 10, 2024
55e2baf
Finish adding basic iteration of biosample index, needs debugging
Tobi1kenobi Sep 10, 2024
692732b
Tweak slightly
Tobi1kenobi Sep 13, 2024
30dc23f
Modified the parser to accept JSON files
Tobi1kenobi Sep 13, 2024
28e1f92
Update biosample index
Tobi1kenobi Sep 16, 2024
33ebf58
Tests and docs
Tobi1kenobi Sep 16, 2024
26a4295
Updating tests
Tobi1kenobi Sep 16, 2024
1c507e6
Revert GWAS catalog file
Tobi1kenobi Sep 16, 2024
567d8e1
fix(biosample index): update to match pre-commit standards
Tobi1kenobi Sep 17, 2024
12293d3
fix(biosample index): merging indices fix
Tobi1kenobi Sep 17, 2024
850f910
fix(biosample index): update study index qc logic
Tobi1kenobi Sep 17, 2024
c42bdd6
fix(biosample index): fix missing mock_biosample_index
Tobi1kenobi Sep 18, 2024
07daedc
chore(biosample index): change datasource name from ontologies
Tobi1kenobi Sep 18, 2024
fb98e15
chore(biosample index): merge local
Tobi1kenobi Sep 18, 2024
b150122
fix(biosample index): add dataset doc
Tobi1kenobi Sep 18, 2024
4fb8d05
Merge branch 'dev' into alegbe-biosample_index
DSuveges Sep 18, 2024
978f636
fix(biosample index): change dbXrefs to xrefs
Tobi1kenobi Sep 19, 2024
2364f8d
Merge branch 'alegbe-biosample_index' of https://github.com/opentarge…
Tobi1kenobi Sep 19, 2024
6f99147
Merge branch 'dev' into alegbe-biosample_index
DSuveges Sep 20, 2024
ec4edf3
chore (biosample index): better commenting
Tobi1kenobi Sep 20, 2024
cf00504
fix(biosample index): various minor tweaks to biosample index
Tobi1kenobi Sep 21, 2024
ad947cb
Merge branch 'alegbe-biosample_index' of https://github.com/opentarge…
Tobi1kenobi Sep 21, 2024
729f492
fix(biosample index): minor bug
Tobi1kenobi Sep 21, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions docs/python_api/datasets/biosample_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---
title: Biosample index
---

::: gentropy.dataset.biosample_index.BiosampleIndex

## Schema

--8<-- "assets/schemas/biosample_index.md"
7 changes: 6 additions & 1 deletion docs/python_api/datasources/_datasources.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ This section contains information about the data source harmonisation tools avai
2. GWAS catalog's [harmonisation pipeline](https://www.ebi.ac.uk/gwas/docs/methods/summary-statistics#_harmonised_summary_statistics_data)
3. Ensembl's [Variant Effect Predictor](https://www.ensembl.org/info/docs/tools/vep/index.html)

## Linkage desiquilibrium
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!

## Linkage disequilibrium

1. [GnomAD](gnomad/_gnomad.md) v2.1.1 LD matrixes (7 ancestries)

Expand All @@ -37,3 +37,8 @@ This section contains information about the data source harmonisation tools avai
## Gene annotation

1. [Open Targets Platform Target Dataset](open_targets/target.md) (derived from Ensembl)

## Biological samples

1. [Uberon](biosample_ontologies/_uberon.md)
2. [Cell Ontology](biosample_ontologies/_cell_ontology.md)
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
title: Cell Ontology
---

The [Cell Ontology](http://www.obofoundry.org/ontology/cl.html) is a structured controlled vocabulary for cell types. It is used to annotate cell types in single-cell RNA-seq data and other omics data.
5 changes: 5 additions & 0 deletions docs/python_api/datasources/biosample_ontologies/_uberon.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
title: Uberon
---

The [Uberon](http://uberon.github.io/) ontology is a multi-species anatomy ontology that integrates cross-species ontologies into a single ontology.
5 changes: 5 additions & 0 deletions docs/python_api/steps/biosample_index_step.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
title: biosample_index
---

::: gentropy.biosample_index.BiosampleIndexStep
3 changes: 2 additions & 1 deletion poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

83 changes: 83 additions & 0 deletions src/gentropy/assets/schemas/biosample_index.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
{
"type": "struct",
"fields": [
{
"name": "biosampleId",
"type": "string",
"nullable": false,
"metadata": {}
},
{
"name": "biosampleName",
"type": "string",
"nullable": true,
"metadata": {}
},
{
"name": "description",
"type": "string",
"nullable": true,
"metadata": {}
},
{
"name": "xrefs",
"type": {
"type": "array",
"elementType": "string",
"containsNull": true
},
"nullable": true,
"metadata": {}
},
{
"name": "synonyms",
"type": {
"type": "array",
"elementType": "string",
"containsNull": true
},
"nullable": true,
"metadata": {}
},
{
"name": "parents",
"type": {
"type": "array",
"elementType": "string",
"containsNull": true
},
"nullable": true,
"metadata": {}
},
{
"name": "ancestors",
"type": {
"type": "array",
"elementType": "string",
"containsNull": true
},
"nullable": true,
"metadata": {}
},
{
"name": "descendants",
"type": {
"type": "array",
"elementType": "string",
"containsNull": true
},
"nullable": true,
"metadata": {}
},
{
"name": "children",
"type": {
"type": "array",
"elementType": "string",
"containsNull": true
},
"nullable": true,
"metadata": {}
}
]
}
34 changes: 34 additions & 0 deletions src/gentropy/biosample_index.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
"""Step to generate biosample index dataset."""
from __future__ import annotations

from gentropy.common.session import Session
from gentropy.datasource.biosample_ontologies.utils import extract_ontology_from_json


class BiosampleIndexStep:
"""Biosample index step.

This step generates a Biosample index dataset from the various ontology sources. Currently Cell Ontology and Uberon are supported.
"""

def __init__(
self,
session: Session,
cell_ontology_input_path: str,
uberon_input_path: str,
biosample_index_path: str,
) -> None:
"""Run Biosample index generation step.

Args:
session (Session): Session object.
cell_ontology_input_path (str): Input cell ontology dataset path.
uberon_input_path (str): Input uberon dataset path.
biosample_index_path (str): Output gene index dataset path.
"""
cell_ontology_index = extract_ontology_from_json(cell_ontology_input_path, session.spark)
uberon_index = extract_ontology_from_json(uberon_input_path, session.spark)

biosample_index = cell_ontology_index.merge_indices([uberon_index])

biosample_index.df.write.mode(session.write_mode).parquet(biosample_index_path)
12 changes: 12 additions & 0 deletions src/gentropy/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,16 @@ class GeneIndexConfig(StepConfig):
_target_: str = "gentropy.gene_index.GeneIndexStep"


@dataclass
class BiosampleIndexConfig(StepConfig):
"""Biosample index step configuration."""

cell_ontology_input_path: str = MISSING
uberon_input_path: str = MISSING
biosample_index_path: str = MISSING
_target_: str = "gentropy.biosample_index.BiosampleIndexStep"


@dataclass
class GWASCatalogStudyCurationConfig(StepConfig):
"""GWAS Catalog study curation step configuration."""
Expand Down Expand Up @@ -505,6 +515,7 @@ class StudyValidationStepConfig(StepConfig):
study_index_path: list[str] = MISSING
target_index_path: str = MISSING
disease_index_path: str = MISSING
biosample_index_path: str = MISSING
valid_study_index_path: str = MISSING
invalid_study_index_path: str = MISSING
invalid_qc_reasons: list[str] = MISSING
Expand Down Expand Up @@ -545,6 +556,7 @@ def register_config() -> None:
cs.store(group="step", name="colocalisation", node=ColocalisationConfig)
cs.store(group="step", name="eqtl_catalogue", node=EqtlCatalogueConfig)
cs.store(group="step", name="gene_index", node=GeneIndexConfig)
cs.store(group="step", name="biosample_index", node=BiosampleIndexConfig)
cs.store(
group="step",
name="gwas_catalog_study_curation",
Expand Down
73 changes: 73 additions & 0 deletions src/gentropy/dataset/biosample_index.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
"""Biosample index dataset."""

from __future__ import annotations

from dataclasses import dataclass
from functools import reduce
from typing import TYPE_CHECKING

import pyspark.sql.functions as f
from pyspark.sql import DataFrame
from pyspark.sql.types import ArrayType, StringType

from gentropy.common.schemas import parse_spark_schema
from gentropy.dataset.dataset import Dataset

if TYPE_CHECKING:
from pyspark.sql.types import StructType


@dataclass
class BiosampleIndex(Dataset):
"""Biosample index dataset.

A Biosample index dataset captures the metadata of the biosamples (e.g. tissues, cell types, cell lines, etc) such as alternate names and relationships with other biosamples.
"""

@classmethod
def get_schema(cls: type[BiosampleIndex]) -> StructType:
"""Provide the schema for the BiosampleIndex dataset.

Returns:
StructType: The schema of the BiosampleIndex dataset.
"""
return parse_spark_schema("biosample_index.json")

@classmethod
def merge_indices(
cls: type[BiosampleIndex],
biosample_indices : list[BiosampleIndex]
) -> BiosampleIndex:
"""Merge a list of biosample indices into a single biosample index.

Where there are conflicts, in single values - the first value is taken. In list values, the union of all values is taken.

Args:
biosample_indices (list[BiosampleIndex]): Biosample indices to merge.

Returns:
BiosampleIndex: Merged biosample index.
"""
# Extract the DataFrames from the BiosampleIndex objects
biosample_dfs = [biosample_index.df for biosample_index in biosample_indices] + [cls.df]

# Merge the DataFrames
merged_df = reduce(DataFrame.unionAll, biosample_dfs)

# Determine aggregation functions for each column
# Currently this will take the first value for single values and merge lists for list values
agg_funcs = []
for field in merged_df.schema.fields:
if field.name != "biosampleId": # Skip the grouping column
if field.dataType == ArrayType(StringType()):
agg_funcs.append(f.array_distinct(f.flatten(f.col(field.name))).alias(field.name))
else:
agg_funcs.append(f.first(f.col(field.name), ignorenulls=True).alias(field.name))

# Perform aggregation
aggregated_df = merged_df.groupBy("biosampleId").agg(*agg_funcs)

return BiosampleIndex(
_df=aggregated_df,
_schema=BiosampleIndex.get_schema()
)
36 changes: 36 additions & 0 deletions src/gentropy/dataset/study_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
from pyspark.sql import Column, DataFrame
from pyspark.sql.types import StructType

from gentropy.dataset.biosample_index import BiosampleIndex
from gentropy.dataset.gene_index import GeneIndex


Expand All @@ -29,13 +30,15 @@ class StudyQualityCheck(Enum):
UNRESOLVED_TARGET (str): Target/gene identifier could not match to reference - Labelling failing target.
UNRESOLVED_DISEASE (str): Disease identifier could not match to referece or retired identifier - labelling failing disease
UNKNOWN_STUDY_TYPE (str): Indicating the provided type of study is not supported.
UNKNOWN_BIOSAMPLE (str): Flagging if a biosample identifier is not found in the reference.
DUPLICATED_STUDY (str): Flagging if a study identifier is not unique.
NO_GENE_PROVIDED (str): Flagging QTL studies if the measured
"""

UNRESOLVED_TARGET = "Target/gene identifier could not match to reference."
UNRESOLVED_DISEASE = "No valid disease identifier found."
UNKNOWN_STUDY_TYPE = "This type of study is not supported."
UNKNOWN_BIOSAMPLE = "Biosample identifier was not found in the reference."
DUPLICATED_STUDY = "The identifier of this study is not unique."
NO_GENE_PROVIDED = "QTL study doesn't have gene assigned."

Expand Down Expand Up @@ -408,3 +411,36 @@ def validate_target(self: StudyIndex, target_index: GeneIndex) -> StudyIndex:
)

return StudyIndex(_df=validated_df, _schema=StudyIndex.get_schema())

def validate_biosample(self: StudyIndex, biosample_index: BiosampleIndex) -> StudyIndex:
"""Validating biosample identifiers in the study index against the provided biosample index.
Comment on lines +415 to +416
Copy link
Contributor

@DSuveges DSuveges Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a critique, rather a comment: we are doing exactly the same steps for genes (disease validation is more complicated). I'm wondering if we can have one function that would be used by both the biosample and gene validation... eg. We would pass the following parameters to this function:

  • this is the list of fields you need to find in
  • this column, and if you don't find, add
  • this flag...

Not exactly sure how to implement this, but sounds quite abstract that could be used in other places and other datasets maybe.


Args:
biosample_index (BiosampleIndex): Biosample index containing a reference of biosample identifiers e.g. cell types, tissues, cell lines, etc.

Returns:
StudyIndex: with flagged studies if biosampleIndex could not be validated.
"""
biosample_set = biosample_index.df.select("biosampleId", f.lit(True).alias("isIdFound"))

validated_df = (
self.df.join(biosample_set, self.df.biosampleFromSourceId == biosample_set.biosampleId, how="left")
.withColumn(
"isIdFound",
f.when(
f.col("isIdFound").isNull(),
f.lit(False),
).otherwise(f.lit(True)),
)
.withColumn(
"qualityControls",
StudyIndex.update_quality_flag(
f.col("qualityControls"),
~f.col("isIdFound"),
StudyQualityCheck.UNKNOWN_BIOSAMPLE,
),
)
.drop("isIdFound").drop("biosampleId")
)

return StudyIndex(_df=validated_df, _schema=StudyIndex.get_schema())
3 changes: 3 additions & 0 deletions src/gentropy/datasource/biosample_ontologies/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
"""Biosample index data source."""

from __future__ import annotations
Loading
Loading