Skip to content

Conversation

@gli527
Copy link
Contributor

@gli527 gli527 commented Jul 30, 2025

initial addition of the DepMap dataset development for SPRAS so far, will add more changes

…ript, and README file describing each folder and raw data download directions
@tristan-f-r tristan-f-r added the dataset Mutating datasets in any way. label Jul 30, 2025
Copy link
Contributor

@tristan-f-r tristan-f-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see another dataset! There does seem to be some extra files leftover in gene_index_mapping_attempt, and I would love to see this pipeline represented as a series of python scripts instead a jupyter notebook, that way we are always able to reproduce this on CI to avoid the script decay issues we got in the HIV, yeast-osmotic-stress, and responsenet datasets.

We're working on this as well in #39 - we don't have any strong examples of reproducibility yet, unfortunately. #25 is the closest to this, but it's locked under a PR and isn't particularity strong in documentation.

Copy link
Collaborator

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to see this pipeline represented as a series of python scripts instead a jupyter notebook, that way we are always able to reproduce this on CI

My guidance was to first focus on transferring private files to the public repo and documenting what they to (absolutely essential) and second work on reproducibility. The first needs to happen before @gli527's summer project ends. The second will ideally make some progress before that as well, but if it doesn't others on the SPRAS team can contribute to it afterward.

I'm leaving initial comments on the overall structure but didn't review the notebook closely yet.

@gli527
Copy link
Contributor Author

gli527 commented Jul 30, 2025

I fixed some files and scripts based on feedback but wasn't able to review all my scripts and fix the automatic changes — I will keep adding updates, just wanted to give an update.

@agitter
Copy link
Collaborator

agitter commented Aug 5, 2025

We are close to having an initial version ready to merge. My goal for the first pull request is to have a notebook and readme that documents everything that was done with this dataset as well as a preliminary script that conducts a parallel automated version of that analysis. The DepMap dataset is still a work in progress. Over the following months, one of us will continue to update the notebook to explore how to include additional cell lines and datasets from DepMap. Once we finalize those decisions, we can update the script accordingly. At the very end, we can decide how to deprecate or archive the notebook and have the script reproduce everything.

There are still a few comments to resolve about yaml formatting (I will fix those if it is tricky, let me know) and raising errors in the script.

@tristan-f-r
Copy link
Contributor

tristan-f-r commented Aug 5, 2025

The YAML changes can all be reverted - one looks a little nicer, but yamlfmt wasn't doing that automatically. (Perhaps it was VSCode's yaml formatter?)

It would be ideal if we could stop committing large processed files (I've already enabled squash merging only as the git history was starting to climb into the megabytes) - since the scripts here do work, we can push some commits to Snakemakeify them and remove the processed files.

@agitter
Copy link
Collaborator

agitter commented Aug 8, 2025

It would be ideal if we could stop committing large processed files

We're lacking a working scratch space for new datasets. Once a dataset is stable, we don't need large processed files in the repo and they waste space. However, when a new dataset is being explored, I find it helpful to have intermediate outputs to understand the dataset and review code outputs. That could happen in yet another repo, but that risks scattering our work even further. I'm open to suggestions.

@tristan-f-r
Copy link
Contributor

Keeping intermediary files in PRs is perfectly fine since the intermediate commits don't end up in the main branch history. We also can work on PRs as long as anyone who wants to work on a dataset has write-access to that branch (or open superseding PRs, like with #25.)

@agitter
Copy link
Collaborator

agitter commented Aug 23, 2025

Keeping intermediary files in PRs is perfectly fine since the intermediate commits don't end up in the main branch history.

Was that a configuration change made at some point during the summer? I see that the default is to squash and merge but don't remember that always being the case.

@tristan-f-r tristan-f-r mentioned this pull request Dec 26, 2025
Copy link
Contributor

@tristan-f-r tristan-f-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All processing happens automatically and Gene IDs are preferred in mapping.

@tristan-f-r tristan-f-r merged commit 487f68d into Reed-CompBio:main Dec 30, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataset Mutating datasets in any way.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants