feat: add biosample index #769

Tobi1kenobi · 2024-09-18T11:32:50Z

✨ Context

🛠 What does this PR implement

🙈 Missing

🚦 Before submitting

Do these changes cover one single feature (one change at a time)?
Did you read the contributor guideline?
Did you make sure to update the documentation with your changes?
Did you make sure there is no commented out code in this PR?
Did you follow conventional commits standards in PR title and commit messages?
Did you make sure the branch is up-to-date with the dev branch?
Did you write any new necessary tests?
Did you make sure the changes pass local tests (make test)?
Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

… in biosample index

Tobi1kenobi · 2024-09-18T15:50:25Z

Few notes:

When joining onto eQTL catalogue three biosampleIds have no match in the biosample index: BTO_0000930 (neuroblast), EFO_0005292 (lymphoblastoid cell line), EFO_0004905 (induced pluripotent stem cell). When thinking about the other use cases for this index some thought should go into the best way to capture these biosamples i.e. what is the best ontology for cell lines.
Current no column for deprecated or not. Not super relevant for current purposes of left-joining and wasn't entirely clear how to implement (drop deprecated, flag them, etc)
dbXrefs currently a flat list of strings rather than split into ID and as done for variant index. This was partially for simplicity, partially because I was unsure how best to split (e.g. CL_1001593 -> [CL_1001593, CL], [CL_1001593, Cell Ontology], [1001593, CL], [CL_1001593, Uberon (ontology actually sourced from)] and partially because I didn't see an immediate use case unlike variant index.

DSuveges · 2024-09-18T22:03:35Z

src/gentropy/assets/schemas/biosample_index.json

+      "name": "dbXrefs",
+      "type": {
+        "type": "array",
+        "elementType": "string",
+        "containsNull": true
+      },
+      "nullable": true,
+      "metadata": {}
+    },


This is not really good. We already have this column in other datasets eg. VariantIndex + others in the platform. The expectation from this filed is that it is a list of structs, where each element not only informs about the identifier, but also about the source.:

{ "metadata": {}, "name": "dbXrefs", "nullable": true, "type": { "containsNull": true, "elementType": { "fields": [ { "metadata": {}, "name": "id", "nullable": true, "type": "string" }, { "metadata": {}, "name": "source", "nullable": true, "type": "string" } ], "type": "struct" }, "type": "array" } }

I agree, sorta... See my comment on the pull request. It wasn't immediately apparent how to structure it

DSuveges · 2024-09-18T22:03:54Z

docs/python_api/datasources/_datasources.md

@@ -26,7 +26,7 @@ This section contains information about the data source harmonisation tools avai
 2. GWAS catalog's [harmonisation pipeline](https://www.ebi.ac.uk/gwas/docs/methods/summary-statistics#_harmonised_summary_statistics_data)
 3. Ensembl's [Variant Effect Predictor](https://www.ensembl.org/info/docs/tools/vep/index.html)

-## Linkage desiquilibrium


Great catch!

…ts/gentropy into alegbe-biosample_index

DSuveges · 2024-09-19T10:48:53Z

src/gentropy/config.py

+    target_path: str = MISSING
+    biosample_index_path: str = MISSING
+    _target_: str = "gentropy.biosample_index.BiosampleIndexStep"


The configuration should have all the inputs that the step needs.

cell_ontology_input_path

uberon_input_path

biosample_index_output_path

All can/should be missing in this case, because there's no hardcoded parameters eg. r2 threshold for LD index. Similarly, target_path needs to be dropped as this parameter is not part of the BiosampleIndexStep parameters.

DSuveges · 2024-09-19T10:56:07Z