Skip to content

Conversation

alimourey
Copy link
Contributor

@alimourey alimourey commented Aug 29, 2025

Comment on lines 72 to 93
if "IsDefaultEntryForModel" not in df.columns:
print(f"No IsDefaultEntryForModel column found in {dataset_id}, skipping preprocessing")
print(
f"No IsDefaultEntryForModel column found in {dataset_id}, skipping preprocessing"
)
return df

print(f"Preprocessing {dataset_id}...")
print("Filtering to default entries per model...")
filtered_df = df[df["IsDefaultEntryForModel"] == "Yes"].copy()

dataset_name = dataset_id.split("/")[-1]
if dataset_name in ["OmicsFusionFiltered", "OmicsProfiles", "OmicsSomaticMutations"]:
if dataset_name in [
"OmicsFusionFiltered",
"OmicsProfiles",
"OmicsSomaticMutations",
]:
print(f"Warning: {dataset_id} has multiple entries per ModelID")
else:
assert not filtered_df["ModelID"].duplicated().any(), f"Duplicate ModelID after filtering in {dataset_id}"
assert (
not filtered_df["ModelID"].duplicated().any()
), f"Duplicate ModelID after filtering in {dataset_id}"
print("Setting ModelID as index...")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: These are not my changes. Just a forced reformat.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish hide whitespace were on by default. It makes this kind of thing vanish from the diff (but you have to explicitly turn it on every time you view one).

long_reads_summary = None
if depmap_long_reads_gcloud_loc is not None:
if len(depmap_long_reads_taiga_ids) > 0:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snwessel This should skip adding the data to the plot for the public environment. We should double check this once public is staged

@@ -5,6 +5,7 @@
rule get_all_data_availability:
inputs:
artifacts=all {"type" ~ "depmap_data_taiga_id|depmap_oncref_taiga_id|rna_merged_version_taiga_id|rnai_drive_taiga_id|repurposing_matrix_taiga_id|ctd2-drug-taiga-id|gdsc_drug_taiga_id|raw-rppa-matrix|proteomics-raw|sanger_methylation_taiga_id|biomarker-correctly-transposed|ccle_mirna_taiga_id|ataq_seq_taiga_id|olink_taiga_id|sanger-proteomics|depmap_paralogs_taiga_id|depmap_long_reads_gcloud_loc"},
depmap_long_reads_datasets = all {'type':'depmap_long_reads_dataset'},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snwessel I'm a little unsure of whether I used "all" correctly here. There should be 4 artifacts of type "depmap_long_reads_dataset", and I want to get them all as a list to use in the get_all_data_availability.py script. If you have issues running this, this will be the first place to check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay sounds good, thank you! I'll keep an eye out for that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that depmap_long_reads_dataset won't be available in the public release, this conseq rule won't run since depmap_long_reads_datasets will result in an empty dict.

We can add this to artifacts and refactor the code accordingly to make it work.

@@ -66,29 +67,37 @@ def preprocess_omics_dataframe(df, dataset_id):
4. Set ModelID as index
5. Drop columns with all NaN values
"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snwessel The Long Reads files now have this IsDefaulyEntryForModel file. This makes me wonder if the datasets need to go through this preprocess_omics_dataframe step. This is something you should probably double check with Phil before deploying these changes to prod.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, that's a good point, thank you! In that case, it might be good to get @naquib314 's review on this too when he's back on Tuesday - he's probably the most familiar with how these new omics indices are handled here

assert "ACHID" in df.columns, f"Column 'ACHID' not found in file {file_name}"
unique_model_ids.update(df["ACHID"].unique())
for taiga_id in taiga_ids:
df = tc.get(taiga_id)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snwessel If the datasets do need to use preprocess_omics_dataframe, it should be used here: preprocess_omics_dataframe(df, taiga_id)

@alimourey alimourey requested a review from snwessel August 29, 2025 17:06
@snwessel snwessel requested a review from naquib314 August 29, 2025 17:20
Copy link
Member

@snwessel snwessel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good to me 👍 but I agree that it might make sense to wait to merge this until we can get Nayeem and/or Phil's input - I'll follow up with them on Tuesday

Copy link
Contributor

@naquib314 naquib314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will refactor and push

@@ -5,6 +5,7 @@
rule get_all_data_availability:
inputs:
artifacts=all {"type" ~ "depmap_data_taiga_id|depmap_oncref_taiga_id|rna_merged_version_taiga_id|rnai_drive_taiga_id|repurposing_matrix_taiga_id|ctd2-drug-taiga-id|gdsc_drug_taiga_id|raw-rppa-matrix|proteomics-raw|sanger_methylation_taiga_id|biomarker-correctly-transposed|ccle_mirna_taiga_id|ataq_seq_taiga_id|olink_taiga_id|sanger-proteomics|depmap_paralogs_taiga_id|depmap_long_reads_gcloud_loc"},
depmap_long_reads_datasets = all {'type':'depmap_long_reads_dataset'},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that depmap_long_reads_dataset won't be available in the public release, this conseq rule won't run since depmap_long_reads_datasets will result in an empty dict.

We can add this to artifacts and refactor the code accordingly to make it work.

@naquib314 naquib314 requested a review from snwessel September 2, 2025 19:27
@snwessel
Copy link
Member

snwessel commented Sep 2, 2025

Thank you for the updates @naquib314 ! I am perhaps not the best person to review pipeline changes, but overall these look good to me 👍

If you feel comfortable merging them in, feel free to do so and I can re-run the pipeline now :) or if you'd like a reviewer, we can wait until Ali is back tomorrow - either way is fine by me

@naquib314
Copy link
Contributor

Cool! The changes should work and we can get an earlier pipeline run. I will merge it in.

@naquib314 naquib314 merged commit 8faabc5 into release-25q3 Sep 2, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants