Update long reads to get new datasets #388

alimourey · 2025-08-29T16:48:25Z

Corresponding depmap-deploy PR: https://github.com/broadinstitute/depmap-deploy/pull/99

alimourey · 2025-08-29T16:54:05Z

pipeline/data_page/get_all_data_availability.py

    if "IsDefaultEntryForModel" not in df.columns:
-        print(f"No IsDefaultEntryForModel column found in {dataset_id}, skipping preprocessing")
+        print(
+            f"No IsDefaultEntryForModel column found in {dataset_id}, skipping preprocessing"
+        )
        return df
-    
+
    print(f"Preprocessing {dataset_id}...")
    print("Filtering to default entries per model...")
    filtered_df = df[df["IsDefaultEntryForModel"] == "Yes"].copy()

    dataset_name = dataset_id.split("/")[-1]
-    if dataset_name in ["OmicsFusionFiltered", "OmicsProfiles", "OmicsSomaticMutations"]:
+    if dataset_name in [
+        "OmicsFusionFiltered",
+        "OmicsProfiles",
+        "OmicsSomaticMutations",
+    ]:
        print(f"Warning: {dataset_id} has multiple entries per ModelID")
    else:
-        assert not filtered_df["ModelID"].duplicated().any(), f"Duplicate ModelID after filtering in {dataset_id}"
+        assert (
+            not filtered_df["ModelID"].duplicated().any()
+        ), f"Duplicate ModelID after filtering in {dataset_id}"
        print("Setting ModelID as index...")


Note: These are not my changes. Just a forced reformat.

I wish hide whitespace were on by default. It makes this kind of thing vanish from the diff (but you have to explicitly turn it on every time you view one).

alimourey · 2025-08-29T16:55:11Z

pipeline/data_page/get_all_data_availability.py

    long_reads_summary = None
-    if depmap_long_reads_gcloud_loc is not None:
+    if len(depmap_long_reads_taiga_ids) > 0:


@snwessel This should skip adding the data to the plot for the public environment. We should double check this once public is staged

alimourey · 2025-08-29T16:57:01Z

pipeline/data_page/get_all_data_availability.conseq

@@ -5,6 +5,7 @@
 rule get_all_data_availability:
    inputs:
        artifacts=all {"type" ~ "depmap_data_taiga_id|depmap_oncref_taiga_id|rna_merged_version_taiga_id|rnai_drive_taiga_id|repurposing_matrix_taiga_id|ctd2-drug-taiga-id|gdsc_drug_taiga_id|raw-rppa-matrix|proteomics-raw|sanger_methylation_taiga_id|biomarker-correctly-transposed|ccle_mirna_taiga_id|ataq_seq_taiga_id|olink_taiga_id|sanger-proteomics|depmap_paralogs_taiga_id|depmap_long_reads_gcloud_loc"},
+        depmap_long_reads_datasets = all {'type':'depmap_long_reads_dataset'},


@snwessel I'm a little unsure of whether I used "all" correctly here. There should be 4 artifacts of type "depmap_long_reads_dataset", and I want to get them all as a list to use in the get_all_data_availability.py script. If you have issues running this, this will be the first place to check.

Okay sounds good, thank you! I'll keep an eye out for that

Assuming that depmap_long_reads_dataset won't be available in the public release, this conseq rule won't run since depmap_long_reads_datasets will result in an empty dict.

We can add this to artifacts and refactor the code accordingly to make it work.

alimourey · 2025-08-29T17:00:20Z

pipeline/data_page/get_all_data_availability.py

@@ -66,29 +67,37 @@ def preprocess_omics_dataframe(df, dataset_id):
    4. Set ModelID as index
    5. Drop columns with all NaN values
    """


@snwessel The Long Reads files now have this IsDefaulyEntryForModel file. This makes me wonder if the datasets need to go through this preprocess_omics_dataframe step. This is something you should probably double check with Phil before deploying these changes to prod.

Oh right, that's a good point, thank you! In that case, it might be good to get @naquib314 's review on this too when he's back on Tuesday - he's probably the most familiar with how these new omics indices are handled here

alimourey · 2025-08-29T17:04:21Z

pipeline/data_page/get_all_data_availability.py

-        assert "ACHID" in df.columns, f"Column 'ACHID' not found in file {file_name}"
-        unique_model_ids.update(df["ACHID"].unique())
+    for taiga_id in taiga_ids:
+        df = tc.get(taiga_id)


@snwessel If the datasets do need to use preprocess_omics_dataframe, it should be used here: preprocess_omics_dataframe(df, taiga_id)

snwessel

These changes look good to me 👍 but I agree that it might make sense to wait to merge this until we can get Nayeem and/or Phil's input - I'll follow up with them on Tuesday

naquib314

I will refactor and push

naquib314 · 2025-09-02T14:26:12Z

pipeline/data_page/get_all_data_availability.conseq

@@ -5,6 +5,7 @@
 rule get_all_data_availability:
    inputs:
        artifacts=all {"type" ~ "depmap_data_taiga_id|depmap_oncref_taiga_id|rna_merged_version_taiga_id|rnai_drive_taiga_id|repurposing_matrix_taiga_id|ctd2-drug-taiga-id|gdsc_drug_taiga_id|raw-rppa-matrix|proteomics-raw|sanger_methylation_taiga_id|biomarker-correctly-transposed|ccle_mirna_taiga_id|ataq_seq_taiga_id|olink_taiga_id|sanger-proteomics|depmap_paralogs_taiga_id|depmap_long_reads_gcloud_loc"},
+        depmap_long_reads_datasets = all {'type':'depmap_long_reads_dataset'},


Assuming that depmap_long_reads_dataset won't be available in the public release, this conseq rule won't run since depmap_long_reads_datasets will result in an empty dict.

We can add this to artifacts and refactor the code accordingly to make it work.

…release

snwessel · 2025-09-02T19:57:50Z

Thank you for the updates @naquib314 ! I am perhaps not the best person to review pipeline changes, but overall these look good to me 👍

If you feel comfortable merging them in, feel free to do so and I can re-run the pipeline now :) or if you'd like a reviewer, we can wait until Ali is back tomorrow - either way is fine by me

naquib314 · 2025-09-02T20:09:39Z

Cool! The changes should work and we can get an earlier pipeline run. I will merge it in.

Update long reads to get new datasets

2911672

alimourey commented Aug 29, 2025

View reviewed changes

alimourey requested a review from snwessel August 29, 2025 17:06

snwessel requested a review from naquib314 August 29, 2025 17:20

snwessel approved these changes Aug 29, 2025

View reviewed changes

naquib314 reviewed Sep 2, 2025

View reviewed changes

Refactor long reads dataset handling in data availability for public …

7a171e1

…release

naquib314 requested a review from snwessel September 2, 2025 19:27

naquib314 merged commit 8faabc5 into release-25q3 Sep 2, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update long reads to get new datasets #388

Update long reads to get new datasets #388

alimourey commented Aug 29, 2025 •

edited

Loading

Uh oh!

alimourey Aug 29, 2025

Uh oh!

rcreasi Sep 2, 2025

Uh oh!

alimourey Aug 29, 2025

Uh oh!

alimourey Aug 29, 2025

Uh oh!

snwessel Aug 29, 2025

Uh oh!

naquib314 Sep 2, 2025

Uh oh!

alimourey Aug 29, 2025

Uh oh!

snwessel Aug 29, 2025

Uh oh!

alimourey Aug 29, 2025

Uh oh!

snwessel left a comment

Uh oh!

naquib314 left a comment

Uh oh!

naquib314 Sep 2, 2025

Uh oh!

snwessel commented Sep 2, 2025

Uh oh!

naquib314 commented Sep 2, 2025

Uh oh!

Uh oh!

Uh oh!

Update long reads to get new datasets #388

Update long reads to get new datasets #388

Conversation

alimourey commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snwessel left a comment

Choose a reason for hiding this comment

Uh oh!

naquib314 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snwessel commented Sep 2, 2025

Uh oh!

naquib314 commented Sep 2, 2025

Uh oh!

Uh oh!

Uh oh!

alimourey commented Aug 29, 2025 •

edited

Loading