-
Notifications
You must be signed in to change notification settings - Fork 9
docs: hiv #42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
docs: hiv #42
Changes from 7 commits
ba25546
c80c561
a38177b
e29e737
70500b3
d1d3cf7
d72db21
e07d6d4
97d3cfa
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1 +1,2 @@ | ||
| processed | ||
| /processed | ||
| /Pickles |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| # HIV dataset | ||
|
|
||
| ## Raw files | ||
|
|
||
| See `raw/README.md`. | ||
|
|
||
| ## File organization | ||
|
|
||
| See `Snakefile` for the way that all of the IO files are connected. | ||
|
|
||
| 1. `fetch.py` - This grabs the score files from https://doi.org/10.1371/journal.ppat.1011492 - see `fetch.py` for more info. | ||
tristan-f-r marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| 1. `prepare.py` - This cleans up the prize files in `raw`; specifically to remove duplicates. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The original readme was more detailed about these processing steps. For instance, one of the steps removed isoform identifiers, which I don't see mentioned here. My goal is to be able to visit the readme for a benchmarking dataset and be able to write the methods section of the manuscript from that information without re-reading all the code. We'll get bogged down during writing if we have to read all the source to remember how we processed each dataset.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This motivation makes sense - I was very confused about what the audience for a dataset-wide README is. My main worry is that, like the other READMEs present in this repository, we would get outdated documentation that doesn't actually match with what's happening during processing, so I do want to ask a follow-up question (to determine how much we should be worried about maintaining the README:) Would future users of SPRAS-benchmarking also be writing down the methodology described in the README as well, or is it specifically for the first benchmarking paper?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (My question was answered in another comment) |
||
| 1. `name_mapping.py` - Converts from UniProt KB-ACID to UniProt KB to meet in the middle with `kegg_ortholog.py`. We chose UniProt KB for its generality. | ||
| 1. `spras_formatting.py` - Formats the input files into a SPRAS-ready format. | ||
| 1. `kegg_orthology.py` - This is used to generate the KEGG ortholog file for gold standards, but this has yet to be finalized. | ||
This file was deleted.
This file was deleted.
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,40 +1,48 @@ | ||
| rule all: | ||
| input: | ||
| "processed/processed_prize_05.txt", | ||
| "processed/processed_prize_060.txt", | ||
| "processed/processed_prizes_05.txt", | ||
| "processed/processed_prizes_060.txt", | ||
| "processed/phosphosite-irefindex13.0-uniprot.txt" | ||
|
|
||
| rule fetch: | ||
| output: | ||
| "raw/prizes_05.tsv", | ||
| "raw/prizes_060.tsv", | ||
| "raw/ko03250.xml" | ||
| shell: | ||
| "uv run scripts/fetch.py" | ||
|
|
||
| rule data_prep: | ||
| input: | ||
| "raw/prize_05.csv", | ||
| "raw/prize_060.csv" | ||
| "raw/prizes_05.tsv", | ||
| "raw/prizes_060.tsv" | ||
| output: | ||
| "Pickles/NodeIDs.pkl" | ||
| shell: | ||
| "uv run Scripts/Data_Prep.py" | ||
| "uv run scripts/prepare.py" | ||
|
|
||
| rule name_mapping: | ||
| input: | ||
| "Pickles/NodeIDs.pkl" | ||
| output: | ||
| "Pickles/UniprotIDs.pkl" | ||
| shell: | ||
| "uv run Scripts/Name_Mapping.py" | ||
| "uv run scripts/name_mapping.py" | ||
|
|
||
| rule spras_formatting: | ||
| input: | ||
| "Pickles/NodeIDs.pkl", | ||
| "Pickles/UniprotIDs.pkl" | ||
| output: | ||
| "processed/processed_prize_05.txt", | ||
| "processed/processed_prize_060.txt" | ||
| "processed/processed_prizes_05.txt", | ||
| "processed/processed_prizes_060.txt" | ||
| shell: | ||
| "uv run Scripts/SPRAS_Formatting.py" | ||
| "uv run scripts/spras_formatting.py" | ||
|
|
||
| rule copy_network: | ||
| input: | ||
| "raw/phosphosite-irefindex13.0-uniprot.txt" | ||
| output: | ||
| "processed/phosphosite-irefindex13.0-uniprot.txt" | ||
| shell: | ||
| "cp raw/phosphosite-irefindex13.0-uniprot.txt processed/phosphosite-irefindex13.0-uniprot.txt" | ||
| "cp raw/phosphosite-irefindex13.0-uniprot.txt processed/phosphosite-irefindex13.0-uniprot.txt" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| prizes_05.tsv | ||
| prizes_060.tsv | ||
| ko03250.xml |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| # raw | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm missing where the rest of the input files come from. Having it in
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm a little confused about this comment (though there was some clarification missing on a paper link, which I've just committed) - do you want docs duplication from
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can call it duplication, but yes. I want the readme(s) to tell the story of the dataset. There will be multiple scripts needed to execute the benchmark, and I see two different goals:
We could go through the exercise of writing one of the paragraph for the benchmarking paper manuscript to see what information we need to expose.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the current level of detail I was going for in the benchmarking report: Structure this section so each dataset explains what the data is and the biological context, where the all data came from, the sources/targets/prizes (include the biological roles of the inputs?), the ids (explain why the translation is needed and why the chosen id was chosen), the gold standard, the interactome, and the preprocessing for everything. |
||
|
|
||
| Some `raw` files are fetched from `../scripts/fetch.py`. | ||
|
|
||
| The `phosphosite-irefindex13.0-uniprot.txt` is | ||
| a background interactome provided by SPRAS: https://github.com/Reed-CompBio/spras/blob/be8bc7f8d71880d7ce9c9ceeeddfefa6eb60c522/input/phosphosite-irefindex13.0-uniprot.txt. | ||
|
|
||
| The `ko03250.xml` is from `https://www.kegg.jp/entry/ko03250`. Specifically, if you click on the pathway image in the entry, | ||
| you'll get to https://www.kegg.jp/pathway/ko03250, where you download the KGML file (which is formatted as a `.xml` file) | ||
| under `Download` -> `KGML`. (The final file is at https://www.kegg.jp/kegg-bin/download?entry=ko03250&format=kgml). | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would find it helpful to have more of the context from https://github.com/Reed-CompBio/spras-benchmarking/blob/0293ae4dc0be59502fac06b42cfd9796a4b4413e/hiv-benchmarking/README.md. What is the overall goal of this benchmarking dataset?
Now that we have know more about the types of datasets in the SPRAS benchmark, we can categorize it. This one uses omic data as input and uses curated data as a gold standard, but the curated data we determined to be a poor fit so it lacks a good gold standard.