-
Notifications
You must be signed in to change notification settings - Fork 9
dataset: DepMap #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dataset: DepMap #41
Conversation
…ript, and README file describing each folder and raw data download directions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to see another dataset! There does seem to be some extra files leftover in gene_index_mapping_attempt, and I would love to see this pipeline represented as a series of python scripts instead a jupyter notebook, that way we are always able to reproduce this on CI to avoid the script decay issues we got in the HIV, yeast-osmotic-stress, and responsenet datasets.
We're working on this as well in #39 - we don't have any strong examples of reproducibility yet, unfortunately. #25 is the closest to this, but it's locked under a PR and isn't particularity strong in documentation.
agitter
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to see this pipeline represented as a series of python scripts instead a jupyter notebook, that way we are always able to reproduce this on CI
My guidance was to first focus on transferring private files to the public repo and documenting what they to (absolutely essential) and second work on reproducibility. The first needs to happen before @gli527's summer project ends. The second will ideally make some progress before that as well, but if it doesn't others on the SPRAS team can contribute to it afterward.
I'm leaving initial comments on the overall structure but didn't review the notebook closely yet.
datasets/depmap/processed/gene_index_mapping_attempt/gene_numbers.txt
Outdated
Show resolved
Hide resolved
|
I fixed some files and scripts based on feedback but wasn't able to review all my scripts and fix the automatic changes — I will keep adding updates, just wanted to give an update. |
Co-authored-by: Anthony Gitter <[email protected]>
|
We are close to having an initial version ready to merge. My goal for the first pull request is to have a notebook and readme that documents everything that was done with this dataset as well as a preliminary script that conducts a parallel automated version of that analysis. The DepMap dataset is still a work in progress. Over the following months, one of us will continue to update the notebook to explore how to include additional cell lines and datasets from DepMap. Once we finalize those decisions, we can update the script accordingly. At the very end, we can decide how to deprecate or archive the notebook and have the script reproduce everything. There are still a few comments to resolve about yaml formatting (I will fix those if it is tricky, let me know) and raising errors in the script. |
|
The YAML changes can all be reverted - one looks a little nicer, but yamlfmt wasn't doing that automatically. (Perhaps it was VSCode's yaml formatter?) It would be ideal if we could stop committing large processed files (I've already enabled squash merging only as the git history was starting to climb into the megabytes) - since the scripts here do work, we can push some commits to Snakemakeify them and remove the processed files. |
We're lacking a working scratch space for new datasets. Once a dataset is stable, we don't need large processed files in the repo and they waste space. However, when a new dataset is being explored, I find it helpful to have intermediate outputs to understand the dataset and review code outputs. That could happen in yet another repo, but that risks scattering our work even further. I'm open to suggestions. |
|
Keeping intermediary files in PRs is perfectly fine since the intermediate commits don't end up in the main branch history. We also can work on PRs as long as anyone who wants to work on a dataset has write-access to that branch (or open superseding PRs, like with #25.) |
Was that a configuration change made at some point during the summer? I see that the default is to squash and merge but don't remember that always being the case. |
Co-Authored-By: Anthony Gitter <[email protected]>
i need to change this to dmm
tristan-f-r
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All processing happens automatically and Gene IDs are preferred in mapping.
initial addition of the DepMap dataset development for SPRAS so far, will add more changes