Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add biosample index #769

Merged
merged 33 commits into from
Sep 24, 2024
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
6fabc98
Initial commit of biosample index
Tobi1kenobi Sep 9, 2024
e8a3775
Make minimal class
Tobi1kenobi Sep 9, 2024
c4d6d5f
Tidy up first draft of adding biosample index
Tobi1kenobi Sep 9, 2024
186e773
Add beginning of logic for checking if biosample from a studyindex is…
Tobi1kenobi Sep 10, 2024
6f0a2e2
Make early file for merging multiple biosample indices into one
Tobi1kenobi Sep 10, 2024
01917a7
Merge branch 'dev' into alegbe-biosample_index
Tobi1kenobi Sep 10, 2024
55e2baf
Finish adding basic iteration of biosample index, needs debugging
Tobi1kenobi Sep 10, 2024
692732b
Tweak slightly
Tobi1kenobi Sep 13, 2024
30dc23f
Modified the parser to accept JSON files
Tobi1kenobi Sep 13, 2024
28e1f92
Update biosample index
Tobi1kenobi Sep 16, 2024
33ebf58
Tests and docs
Tobi1kenobi Sep 16, 2024
26a4295
Updating tests
Tobi1kenobi Sep 16, 2024
1c507e6
Revert GWAS catalog file
Tobi1kenobi Sep 16, 2024
567d8e1
fix(biosample index): update to match pre-commit standards
Tobi1kenobi Sep 17, 2024
12293d3
fix(biosample index): merging indices fix
Tobi1kenobi Sep 17, 2024
850f910
fix(biosample index): update study index qc logic
Tobi1kenobi Sep 17, 2024
c42bdd6
fix(biosample index): fix missing mock_biosample_index
Tobi1kenobi Sep 18, 2024
07daedc
chore(biosample index): change datasource name from ontologies
Tobi1kenobi Sep 18, 2024
fb98e15
chore(biosample index): merge local
Tobi1kenobi Sep 18, 2024
b150122
fix(biosample index): add dataset doc
Tobi1kenobi Sep 18, 2024
4fb8d05
Merge branch 'dev' into alegbe-biosample_index
DSuveges Sep 18, 2024
978f636
fix(biosample index): change dbXrefs to xrefs
Tobi1kenobi Sep 19, 2024
2364f8d
Merge branch 'alegbe-biosample_index' of https://github.com/opentarge…
Tobi1kenobi Sep 19, 2024
6f99147
Merge branch 'dev' into alegbe-biosample_index
DSuveges Sep 20, 2024
ec4edf3
chore (biosample index): better commenting
Tobi1kenobi Sep 20, 2024
cf00504
fix(biosample index): various minor tweaks to biosample index
Tobi1kenobi Sep 21, 2024
ad947cb
Merge branch 'alegbe-biosample_index' of https://github.com/opentarge…
Tobi1kenobi Sep 21, 2024
729f492
fix(biosample index): minor bug
Tobi1kenobi Sep 21, 2024
1e660c2
fix(biosample index): fix merge shift to method
Tobi1kenobi Sep 23, 2024
ca4fce3
feat(biosample index): make biosampleName not nullable
Tobi1kenobi Sep 23, 2024
73b25da
Merge branch 'dev' into alegbe-biosample_index
DSuveges Sep 24, 2024
c9eada2
Merge branch 'dev' into alegbe-biosample_index
DSuveges Sep 24, 2024
f93b995
Merge branch 'dev' into alegbe-biosample_index
DSuveges Sep 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions docs/python_api/datasets/biosample_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---
title: Biosample index
---

::: gentropy.dataset.biosample_index.BiosampleIndex

## Schema

--8<-- "assets/schemas/biosample_index.md"
7 changes: 6 additions & 1 deletion docs/python_api/datasources/_datasources.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ This section contains information about the data source harmonisation tools avai
2. GWAS catalog's [harmonisation pipeline](https://www.ebi.ac.uk/gwas/docs/methods/summary-statistics#_harmonised_summary_statistics_data)
3. Ensembl's [Variant Effect Predictor](https://www.ensembl.org/info/docs/tools/vep/index.html)

## Linkage desiquilibrium
Tobi1kenobi marked this conversation as resolved.
Show resolved Hide resolved
## Linkage disequilibrium

1. [GnomAD](gnomad/_gnomad.md) v2.1.1 LD matrixes (7 ancestries)

Expand All @@ -37,3 +37,8 @@ This section contains information about the data source harmonisation tools avai
## Gene annotation

1. [Open Targets Platform Target Dataset](open_targets/target.md) (derived from Ensembl)

## Biological samples

1. [Uberon](biosample_ontologies/_uberon.md)
2. [Cell Ontology](biosample_ontologies/_cell_ontology.md)
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
title: Cell Ontology
---

The [Cell Ontology](http://www.obofoundry.org/ontology/cl.html) is a structured controlled vocabulary for cell types. It is used to annotate cell types in single-cell RNA-seq data and other omics data.
5 changes: 5 additions & 0 deletions docs/python_api/datasources/biosample_ontologies/_uberon.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
title: Uberon
---

The [Uberon](http://uberon.github.io/) ontology is a multi-species anatomy ontology that integrates cross-species ontologies into a single ontology.
5 changes: 5 additions & 0 deletions docs/python_api/steps/biosample_index_step.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
title: biosample_index
---

::: gentropy.biosample_index.BiosampleIndexStep
83 changes: 83 additions & 0 deletions src/gentropy/assets/schemas/biosample_index.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
{
"type": "struct",
"fields": [
{
"name": "biosampleId",
"type": "string",
"nullable": false,
"metadata": {}
},
{
"name": "biosampleName",
"type": "string",
"nullable": true,
"metadata": {}
},
{
"name": "description",
"type": "string",
"nullable": true,
"metadata": {}
},
{
"name": "xrefs",
"type": {
"type": "array",
"elementType": "string",
"containsNull": true
},
"nullable": true,
"metadata": {}
},
{
"name": "synonyms",
"type": {
"type": "array",
"elementType": "string",
"containsNull": true
},
"nullable": true,
"metadata": {}
},
{
"name": "parents",
"type": {
"type": "array",
"elementType": "string",
"containsNull": true
},
"nullable": true,
"metadata": {}
},
{
"name": "ancestors",
"type": {
"type": "array",
"elementType": "string",
"containsNull": true
},
"nullable": true,
"metadata": {}
},
{
"name": "descendants",
"type": {
"type": "array",
"elementType": "string",
"containsNull": true
},
"nullable": true,
"metadata": {}
},
{
"name": "children",
"type": {
"type": "array",
"elementType": "string",
"containsNull": true
},
"nullable": true,
"metadata": {}
}
]
}
37 changes: 37 additions & 0 deletions src/gentropy/biosample_index.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
"""Step to generate biosample index dataset."""
from __future__ import annotations

from gentropy.common.session import Session
from gentropy.datasource.biosample_ontologies.utils import (
extract_ontology_from_json,
merge_biosample_indices,
)


class BiosampleIndexStep:
"""Biosample index step.

This step generates a Biosample index dataset from the various ontology sources. Currently Cell Ontology and Uberon are supported.
"""

def __init__(
self,
session: Session,
cell_ontology_input_path: str,
uberon_input_path: str,
biosample_index_output_path: str,
) -> None:
"""Run Biosample index generation step.

Args:
session (Session): Session object.
cell_ontology_input_path (str): Input cell ontology dataset path.
uberon_input_path (str): Input uberon dataset path.
biosample_index_output_path (str): Output gene index dataset path.
"""
cell_ontology_index = extract_ontology_from_json(cell_ontology_input_path, session.spark)
uberon_index = extract_ontology_from_json(uberon_input_path, session.spark)

biosample_index = merge_biosample_indices([cell_ontology_index, uberon_index])

biosample_index.df.write.mode(session.write_mode).parquet(biosample_index_output_path)
10 changes: 10 additions & 0 deletions src/gentropy/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,15 @@ class GeneIndexConfig(StepConfig):
_target_: str = "gentropy.gene_index.GeneIndexStep"


@dataclass
class BiosampleIndexConfig(StepConfig):
"""Biosample index step configuration."""

target_path: str = MISSING
biosample_index_path: str = MISSING
_target_: str = "gentropy.biosample_index.BiosampleIndexStep"
Tobi1kenobi marked this conversation as resolved.
Show resolved Hide resolved


@dataclass
class GWASCatalogStudyCurationConfig(StepConfig):
"""GWAS Catalog study curation step configuration."""
Expand Down Expand Up @@ -545,6 +554,7 @@ def register_config() -> None:
cs.store(group="step", name="colocalisation", node=ColocalisationConfig)
cs.store(group="step", name="eqtl_catalogue", node=EqtlCatalogueConfig)
cs.store(group="step", name="gene_index", node=GeneIndexConfig)
cs.store(group="step", name="biosample_index", node=BiosampleIndexConfig)
cs.store(
group="step",
name="gwas_catalog_study_curation",
Expand Down
29 changes: 29 additions & 0 deletions src/gentropy/dataset/biosample_index.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
"""Biosample index dataset."""

from __future__ import annotations

from dataclasses import dataclass
from typing import TYPE_CHECKING

from gentropy.common.schemas import parse_spark_schema
from gentropy.dataset.dataset import Dataset

if TYPE_CHECKING:
from pyspark.sql.types import StructType


@dataclass
class BiosampleIndex(Dataset):
"""Biosample index dataset.

A Biosample index dataset captures the metadata of the biosamples (e.g. tissues, cell types, cell lines, etc) such as alternate names and relationships with other biosamples.
"""

@classmethod
def get_schema(cls: type[BiosampleIndex]) -> StructType:
"""Provide the schema for the BiosampleIndex dataset.

Returns:
StructType: The schema of the BiosampleIndex dataset.
"""
return parse_spark_schema("biosample_index.json")
36 changes: 36 additions & 0 deletions src/gentropy/dataset/study_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
from pyspark.sql import Column, DataFrame
from pyspark.sql.types import StructType

from gentropy.dataset.biosample_index import BiosampleIndex
from gentropy.dataset.gene_index import GeneIndex


Expand All @@ -29,13 +30,15 @@ class StudyQualityCheck(Enum):
UNRESOLVED_TARGET (str): Target/gene identifier could not match to reference - Labelling failing target.
UNRESOLVED_DISEASE (str): Disease identifier could not match to referece or retired identifier - labelling failing disease
UNKNOWN_STUDY_TYPE (str): Indicating the provided type of study is not supported.
UNKNOWN_BIOSAMPLE (str): Flagging if a biosample identifier is not found in the reference.
DUPLICATED_STUDY (str): Flagging if a study identifier is not unique.
NO_GENE_PROVIDED (str): Flagging QTL studies if the measured
"""

UNRESOLVED_TARGET = "Target/gene identifier could not match to reference."
UNRESOLVED_DISEASE = "No valid disease identifier found."
UNKNOWN_STUDY_TYPE = "This type of study is not supported."
UNKNOWN_BIOSAMPLE = "Biosample identifier was not found in the reference."
DUPLICATED_STUDY = "The identifier of this study is not unique."
NO_GENE_PROVIDED = "QTL study doesn't have gene assigned."

Expand Down Expand Up @@ -408,3 +411,36 @@ def validate_target(self: StudyIndex, target_index: GeneIndex) -> StudyIndex:
)

return StudyIndex(_df=validated_df, _schema=StudyIndex.get_schema())

def validate_biosample(self: StudyIndex, biosample_index: BiosampleIndex) -> StudyIndex:
"""Validating biosample identifiers in the study index against the provided biosample index.
Comment on lines +413 to +414
Copy link
Contributor

@DSuveges DSuveges Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a critique, rather a comment: we are doing exactly the same steps for genes (disease validation is more complicated). I'm wondering if we can have one function that would be used by both the biosample and gene validation... eg. We would pass the following parameters to this function:

  • this is the list of fields you need to find in
  • this column, and if you don't find, add
  • this flag...

Not exactly sure how to implement this, but sounds quite abstract that could be used in other places and other datasets maybe.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After doing a few of these, the problem I see in abstracting this further is that the list of rules that decide when to add a flag is very diverse. Sometimes, it is just a filter, but sometimes, it has dependencies, aggregations, etc.

I'm referring to everything that leads to the second argument of update_quality_flag if this is what you mean here


Args:
biosample_index (BiosampleIndex): Biosample index containing a reference of biosample identifiers e.g. cell types, tissues, cell lines, etc.

Returns:
StudyIndex: with flagged studies if biosampleIndex could not be validated.
"""
biosample_set = biosample_index.df.select("biosampleId", f.lit(True).alias("isIdFound"))

validated_df = (
self.df.join(biosample_set, self.df.biosampleFromSourceId == biosample_set.biosampleId, how="left")
.withColumn(
"isIdFound",
f.when(
f.col("isIdFound").isNull(),
f.lit(False),
).otherwise(f.lit(True)),
)
.withColumn(
"qualityControls",
StudyIndex.update_quality_flag(
f.col("qualityControls"),
~f.col("isIdFound"),
StudyQualityCheck.UNKNOWN_BIOSAMPLE,
),
)
.drop("isIdFound").drop("biosampleId")
)

return StudyIndex(_df=validated_df, _schema=StudyIndex.get_schema())
3 changes: 3 additions & 0 deletions src/gentropy/datasource/biosample_ontologies/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
"""Biosample index data source."""

from __future__ import annotations
Loading