Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions configs/dmmm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,12 +46,12 @@ datasets:
# TODO: use old paramaters for datasets
# HIV: https://github.com/Reed-CompBio/spras-benchmarking/blob/0293ae4dc0be59502fac06b42cfd9796a4b4413e/hiv-benchmarking/spras-config/config.yaml
- label: dmmmhiv060
node_files: ["processed_prize_060.txt"]
node_files: ["processed_prizes_060.txt"]
edge_files: ["phosphosite-irefindex13.0-uniprot.txt"]
other_files: []
data_dir: "datasets/hiv/processed"
- label: dmmmhiv05
node_files: ["processed_prize_05.txt"]
node_files: ["processed_prizes_05.txt"]
edge_files: ["phosphosite-irefindex13.0-uniprot.txt"]
other_files: []
data_dir: "datasets/hiv/processed"
Expand Down
3 changes: 2 additions & 1 deletion datasets/hiv/.gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
processed
/processed
/Pickles
15 changes: 15 additions & 0 deletions datasets/hiv/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# HIV dataset

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would find it helpful to have more of the context from https://github.com/Reed-CompBio/spras-benchmarking/blob/0293ae4dc0be59502fac06b42cfd9796a4b4413e/hiv-benchmarking/README.md. What is the overall goal of this benchmarking dataset?

Now that we have know more about the types of datasets in the SPRAS benchmark, we can categorize it. This one uses omic data as input and uses curated data as a gold standard, but the curated data we determined to be a poor fit so it lacks a good gold standard.

## Raw files

See `raw/README.md`.

## File organization

See `Snakefile` for the way that all of the IO files are connected.

1. `fetch.py` - This grabs the score files from https://doi.org/10.1371/journal.ppat.1011492
1. `prepare.py` - This cleans up the prize files in `raw`; specifically to remove duplicates.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original readme was more detailed about these processing steps. For instance, one of the steps removed isoform identifiers, which I don't see mentioned here. My goal is to be able to visit the readme for a benchmarking dataset and be able to write the methods section of the manuscript from that information without re-reading all the code. We'll get bogged down during writing if we have to read all the source to remember how we processed each dataset.

Copy link
Contributor Author

@tristan-f-r tristan-f-r Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This motivation makes sense - I was very confused about what the audience for a dataset-wide README is. My main worry is that, like the other READMEs present in this repository, we would get outdated documentation that doesn't actually match with what's happening during processing, so I do want to ask a follow-up question (to determine how much we should be worried about maintaining the README:)

Would future users of SPRAS-benchmarking also be writing down the methodology described in the README as well, or is it specifically for the first benchmarking paper?

Copy link
Contributor Author

@tristan-f-r tristan-f-r Aug 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(My question was answered in another comment)

1. `name_mapping.py` - Converts from UniProt KB-ACID to UniProt KB to meet in the middle with `kegg_ortholog.py`. We chose UniProt KB for its generality.
1. `spras_formatting.py` - Formats the input files into a SPRAS-ready format.
1. `kegg_orthology.py` - This is used to generate the KEGG ortholog file for gold standards, but this has yet to be finalized.
31 changes: 0 additions & 31 deletions datasets/hiv/Scripts/Data_Prep.py

This file was deleted.

64 changes: 0 additions & 64 deletions datasets/hiv/Scripts/Kegg_Orthology.py

This file was deleted.

28 changes: 0 additions & 28 deletions datasets/hiv/Scripts/SPRAS_Formatting.py

This file was deleted.

28 changes: 18 additions & 10 deletions datasets/hiv/Snakefile
Original file line number Diff line number Diff line change
@@ -1,40 +1,48 @@
rule all:
input:
"processed/processed_prize_05.txt",
"processed/processed_prize_060.txt",
"processed/processed_prizes_05.txt",
"processed/processed_prizes_060.txt",
"processed/phosphosite-irefindex13.0-uniprot.txt"

rule fetch:
output:
"raw/prizes_05.tsv",
"raw/prizes_060.tsv",
"raw/ko03250.xml"
shell:
"uv run scripts/fetch.py"

rule data_prep:
input:
"raw/prize_05.csv",
"raw/prize_060.csv"
"raw/prizes_05.tsv",
"raw/prizes_060.tsv"
output:
"Pickles/NodeIDs.pkl"
shell:
"uv run Scripts/Data_Prep.py"
"uv run scripts/prepare.py"

rule name_mapping:
input:
"Pickles/NodeIDs.pkl"
output:
"Pickles/UniprotIDs.pkl"
shell:
"uv run Scripts/Name_Mapping.py"
"uv run scripts/name_mapping.py"

rule spras_formatting:
input:
"Pickles/NodeIDs.pkl",
"Pickles/UniprotIDs.pkl"
output:
"processed/processed_prize_05.txt",
"processed/processed_prize_060.txt"
"processed/processed_prizes_05.txt",
"processed/processed_prizes_060.txt"
shell:
"uv run Scripts/SPRAS_Formatting.py"
"uv run scripts/spras_formatting.py"

rule copy_network:
input:
"raw/phosphosite-irefindex13.0-uniprot.txt"
output:
"processed/phosphosite-irefindex13.0-uniprot.txt"
shell:
"cp raw/phosphosite-irefindex13.0-uniprot.txt processed/phosphosite-irefindex13.0-uniprot.txt"
"cp raw/phosphosite-irefindex13.0-uniprot.txt processed/phosphosite-irefindex13.0-uniprot.txt"
3 changes: 3 additions & 0 deletions datasets/hiv/raw/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
prizes_05.tsv
prizes_060.tsv
ko03250.xml
10 changes: 10 additions & 0 deletions datasets/hiv/raw/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# raw
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing where the rest of the input files come from. Having it in fetch.py does not tell me where the inputs for this dataset come from and what they are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about this comment (though there was some clarification missing on a paper link, which I've just committed) - do you want docs duplication from fetch.py into raw/README.md? The inline and top-level comments in fetch.py describe more than the original hiv README did, including the origin of ko03250.xml.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can call it duplication, but yes. I want the readme(s) to tell the story of the dataset. There will be multiple scripts needed to execute the benchmark, and I see two different goals:

  1. the code with good comments
  2. documenting the dataset globally so that we (and potentially external readers of a manuscript) can easily translate from a dataset directory to an overall understanding of how all the pieces work together, where external information comes from, and gather that information in a manuscript.

We could go through the exercise of writing one of the paragraph for the benchmarking paper manuscript to see what information we need to expose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the current level of detail I was going for in the benchmarking report:

Structure this section so each dataset explains what the data is and the biological context, where the all data came from, the sources/targets/prizes (include the biological roles of the inputs?), the ids (explain why the translation is needed and why the chosen id was chosen), the gold standard, the interactome, and the preprocessing for everything.


Some `raw` files are fetched from `../scripts/fetch.py`.

The `phosphosite-irefindex13.0-uniprot.txt` is
a background interactome provided by SPRAS: https://github.com/Reed-CompBio/spras/blob/be8bc7f8d71880d7ce9c9ceeeddfefa6eb60c522/input/phosphosite-irefindex13.0-uniprot.txt.

The `ko03250.xml` is from `https://www.kegg.jp/entry/ko03250`. Specifically, if you click on the pathway image in the entry,
you'll get to https://www.kegg.jp/pathway/ko03250, where you download the KGML file (which is formatted as a `.xml` file)
under `Download` -> `KGML`. (The final file is at https://www.kegg.jp/kegg-bin/download?entry=ko03250&format=kgml).
Loading
Loading