-
Notifications
You must be signed in to change notification settings - Fork 9
dataset: DepMap #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
dataset: DepMap #41
Changes from 8 commits
Commits
Show all changes
29 commits
Select commit
Hold shift + click to select a range
a08a841
Initial addition of all processed data, example config file, local sc…
cc2309c
applied changes on documentation and files based on feedback
4b64c64
Update datasets/depmap/README.md
gli527 7373401
fix typos and formatting errors
04595de
fix typoes and formatting errors
1a59454
fix typoes and formatting errors
e7ac55f
added python scripts for cell line processing
e63e9aa
updated readme for scripts
35d1fb9
added scripts for Uniprot mapping, updated readme and config file
81925ea
added error handling for cell line processing
gli527 b332c0a
Revert yaml and lint script
agitter 91c8e0a
Merge branch 'main' into pr/41
tristan-f-r 482815c
chore: bump spras
tristan-f-r 8dbdf2a
Merge branch 'depmap' of https://github.com/gli527/spras-benchmarking…
tristan-f-r 3c98a5c
chore(spras): bump [again!]
tristan-f-r d788e2b
style: fmt
tristan-f-r 8daf02e
Merge branch 'main' into pr/41
tristan-f-r e90f384
chore: set up fetch
tristan-f-r 4650a28
fix: more path changes
tristan-f-r e1ff0e5
chore: drop output date from uniprot mapping
tristan-f-r 5bec9c9
chore: apply suggestions
tristan-f-r e6f5d53
chore: correct file names, more docs
tristan-f-r ba75406
feat: map uniprot through gene ids when available
tristan-f-r b19d925
fix: merge cellline_fadu config with dmmm
tristan-f-r 745e41c
style: fmt
tristan-f-r dd4707a
fix: clear up irefindex interactome dependency from hiv
tristan-f-r 8987a88
chore: add depmap to run_snakemake.sh
tristan-f-r c877384
fix: commas for process input
tristan-f-r d946192
fix: dmmm prefix
tristan-f-r File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| raw | ||
| processed |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| # Cancer Dependency Map Dataset | ||
|
|
||
| This folder contains the processed data and the scripts for data analysis and preparation on datasets from The Cancer Dependency Map, an initiative led by the Broad Institute to provide large-scale omics data in identifying cancer dependencies/vulnerabilities. | ||
|
|
||
| You can read more about DepMap and the projects included here: https://www.broadinstitute.org/cancer/cancer-dependency-map | ||
|
|
||
| ## Raw Data | ||
| You can visit the DepMap all data downloads portal at: https://depmap.org/portal/data_page/?tab=allData | ||
| Download the following datasets under the primary files section of DepMap and move them to a directory named `raw` that you create. The dataset descriptions from the website are also included: | ||
|
|
||
| Currently used files: | ||
|
|
||
| - `OmicsProfiles.csv`: Omics metadata and ID mapping information for files indexed by Profile ID. This dataset is used for mapping cell line names to DepMap model IDs as a basis for data processing. (file URL: https://depmap.org/portal/data_page/?tab=allData&releasename=DepMap%20Public%2025Q2&filename=OmicsProfiles.csv) | ||
| - `CRISPRGeneDependency.csv`: Gene dependency probability estimates for all models in the integrated gene effect. This dataset is used to identify gold standard genes in each cell line, a dependency probability cutoff of 0.5 is currently used to get the genes with considerable impact on the cell line. (file URL: https://depmap.org/portal/data_page/?tab=allData&releasename=DepMap%20Public%2025Q2&filename=CRISPRGeneDependency.csv) | ||
| - `OmicsSomaticMutationsMatrixDamaging.csv`: Genotyped matrix determining for each cell line whether each gene has at least one damaging mutation. A variant is considered a damaging mutation if LikelyLoF == True. (0 == no mutation; If there is one or more damaging mutations in the same gene for the same cell line, the allele frequencies are summed, and if the sum is greater than 0.95, a value of 2 is assigned and if not, a value of 1 is assigned.). This dataset is used to prepare the input prize file. (file URL: https://depmap.org/portal/data_page/?tab=allData&releasename=DepMap%20Public%2025Q2&filename=OmicsSomaticMutationsMatrixDamaging.csv) | ||
|
|
||
| Future extension files: | ||
|
|
||
| - `OmicsExpressionProteinCodingGenesTPMLogp1.csv`: Model-level TPMs derived from Salmon v1.10.0 (Patro et al 2017) Rows: Model IDs Columns: Gene names. (file URL: https://depmap.org/portal/data_page/?tab=allData&releasename=DepMap%20Public%2025Q2&filename=OmicsExpressionProteinCodingGenesTPMLogp1.csv) | ||
| - `OmicsCNGeneWGS.csv`: Gene-level copy number data inferred from WGS data only. Additional copy number datasets are available for download as part of the full DepMap Data Release. (file URL: https://depmap.org/portal/data_page/?tab=allData&releasename=DepMap%20Public%2025Q2&filename=OmicsCNGeneWGS.csv) | ||
|
|
||
|
|
||
| ## Scripts | ||
| Currently contains: | ||
| - `local_cell_line_preprocessing.ipynb`: Jupyter notebook for exploratory data analysis and initial pipeline development. Includes CRISPR dependency analysis with multiple thresholds, visualization of gene dependency distributions, UniProt ID mapping workflow (both gene symbols and gene numbers approaches currently), and step-by-step generation of prize input files and gold standard files for individual cell lines. | ||
| - `cell_line_processing.py`: General cell line processing pipeline for generating prize input files and gold standard files converted into Python scripts. Should be reproducible for any cell line name, could be further organized and refined. | ||
|
|
||
|
|
||
| Files used for preparing required files: | ||
| - `OmicsProfiles.csv` used for mapping cell line names to DepMap model IDs. | ||
| - `OmicsSomaticMutationsMatrixDamaging.csv` used for preparing prize input file. | ||
| - `CRISPRGeneDependency.csv` used for preparing gold standard output. | ||
|
|
||
| ## Processed Data | ||
| Files used for UniProt ID mapping: | ||
| - `DamagingMutationsGeneSymbols_20250718.csv`: Gene symbols parsed from gene columns in `OmicsSomaticMutationsMatrixDamaging.csv` on the date described | ||
| - `DamagingMutations_idMapping_20250718.tsv`: Gene symbols from `DamagingMutationsGeneSymbols_20250718.csv` mapped to UniProt IDs using UniProt Web Service on the date described | ||
| - Folder of processed data for an attempt to do UniProt mapping with the gene index numbers instead, got stuck due to duplicate matches for the same gene number. A future step could be referring to the original mutations file (OmicsSomaticMutations.csv on DepMap, URL: https://depmap.org/portal/data_page/?tab=allData&releasename=DepMap%20Public%2025Q2&filename=OmicsSomaticMutations.csv) for gene numbers with duplicate matches and do exact matches by seeing where the mutation is located and get more accurate mappings. Contains preliminary processed data (all as of 07/24/2025): | ||
| - `gene_index_mapping_attempt\gene_numbers.txt`: Gene index numbers parsed from gene columns in `OmicsSomaticMutationsMatrixDamaging.csv` | ||
| - `raw_uniprot_idmapping_2025_07_24.tsv`: Initial mapping results, contains both reviewed and unreviewed results, wasn't able to filter directly on UniProt Web Service due to volume | ||
| - `reviewed_id_mapping_2025_07_24.tsv`: Filtered mapping results to only reviewed matches | ||
| - `duplicated_mapping_entries.tsv`: Gene index numbers with duplicate matches | ||
|
|
||
| Started processing with the FADU cell line: | ||
| - Input prize file prepared from the damaging mutations dataset | ||
| - Gold standard file prepared from the CRISPR gene dependency dataset | ||
|
|
||
| ## config | ||
| Example Config file used to get preliminary results on OmicsIntegrator1 and 2 following the EGFR dataset example. Will test out more parameters and update. | ||
| The input edge file for the background network can be obtained from the SPRAS repo [`input/phosphosite-irefindex13.0-uniprot.txt`](https://github.com/Reed-CompBio/spras/blob/b5d7a2499afa8eab14c60ce0f99fa7e8a23a2c64/input/phosphosite-irefindex13.0-uniprot.txt) | ||
|
|
||
| ## Release Citation | ||
| For DepMap Release data, including CRISPR Screens, PRISM Drug Screens, Copy Number, Mutation, Expression, and Fusions: | ||
| DepMap, Broad (2025). DepMap Public 25Q2. Dataset. depmap.org |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,91 @@ | ||
| # The length of the hash used to identify a parameter combination | ||
| hash_length: 7 | ||
|
|
||
| # If true, use Singularity instead of Docker | ||
| # Singularity support is only available on Unix | ||
| singularity: false | ||
|
|
||
| algorithms: | ||
| - name: pathlinker | ||
| params: | ||
| include: false | ||
| run1: | ||
| k: | ||
| - 10 | ||
| - 20 | ||
| - name: omicsintegrator1 | ||
| params: | ||
| include: true | ||
| run1: | ||
| b: | ||
| - 2 | ||
| - 4 | ||
| - 7 | ||
| - 10 | ||
| d: | ||
| - 10 | ||
| g: | ||
| - 1e-3 | ||
| r: | ||
| - 0.01 | ||
| w: | ||
| - 0.1 | ||
| - 1 | ||
| mu: | ||
| - 0.01 | ||
| - 0.1 | ||
| dummy_mode: ["terminal"] | ||
| - name: omicsintegrator2 | ||
| params: | ||
| include: true | ||
| run1: | ||
| b: | ||
| - 4 | ||
| g: | ||
| - 0 | ||
| run2: | ||
| b: | ||
| - 2 | ||
| g: | ||
| - 3 | ||
| - name: meo | ||
| params: | ||
| include: false | ||
| run1: | ||
| local_search: | ||
| - "Yes" | ||
| max_path_length: | ||
| - 3 | ||
| rand_restarts: | ||
| - 10 | ||
| - name: domino | ||
| params: | ||
| include: false | ||
| run1: | ||
| slice_threshold: | ||
| - 0.3 | ||
| module_threshold: | ||
| - 0.05 | ||
| datasets: | ||
| - data_dir: input | ||
| edge_files: | ||
| - phosphosite-irefindex13.0-uniprot.txt | ||
agitter marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| label: cellline | ||
| node_files: | ||
| - cellline_fadu_nodes.txt | ||
| other_files: [] | ||
| reconstruction_settings: | ||
tristan-f-r marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| locations: | ||
| reconstruction_dir: output/cellline_fadu | ||
| run: true | ||
| analysis: | ||
| graphspace: | ||
| include: false | ||
| cytoscape: | ||
| include: true | ||
| summary: | ||
| include: true | ||
| ml: | ||
| include: false | ||
| evaluation: | ||
| include: false | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.