-
Notifications
You must be signed in to change notification settings - Fork 1
Update long reads to get new datasets #388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
if "IsDefaultEntryForModel" not in df.columns: | ||
print(f"No IsDefaultEntryForModel column found in {dataset_id}, skipping preprocessing") | ||
print( | ||
f"No IsDefaultEntryForModel column found in {dataset_id}, skipping preprocessing" | ||
) | ||
return df | ||
|
||
print(f"Preprocessing {dataset_id}...") | ||
print("Filtering to default entries per model...") | ||
filtered_df = df[df["IsDefaultEntryForModel"] == "Yes"].copy() | ||
|
||
dataset_name = dataset_id.split("/")[-1] | ||
if dataset_name in ["OmicsFusionFiltered", "OmicsProfiles", "OmicsSomaticMutations"]: | ||
if dataset_name in [ | ||
"OmicsFusionFiltered", | ||
"OmicsProfiles", | ||
"OmicsSomaticMutations", | ||
]: | ||
print(f"Warning: {dataset_id} has multiple entries per ModelID") | ||
else: | ||
assert not filtered_df["ModelID"].duplicated().any(), f"Duplicate ModelID after filtering in {dataset_id}" | ||
assert ( | ||
not filtered_df["ModelID"].duplicated().any() | ||
), f"Duplicate ModelID after filtering in {dataset_id}" | ||
print("Setting ModelID as index...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: These are not my changes. Just a forced reformat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish hide whitespace were on by default. It makes this kind of thing vanish from the diff (but you have to explicitly turn it on every time you view one).
long_reads_summary = None | ||
if depmap_long_reads_gcloud_loc is not None: | ||
if len(depmap_long_reads_taiga_ids) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@snwessel This should skip adding the data to the plot for the public environment. We should double check this once public is staged
@@ -5,6 +5,7 @@ | |||
rule get_all_data_availability: | |||
inputs: | |||
artifacts=all {"type" ~ "depmap_data_taiga_id|depmap_oncref_taiga_id|rna_merged_version_taiga_id|rnai_drive_taiga_id|repurposing_matrix_taiga_id|ctd2-drug-taiga-id|gdsc_drug_taiga_id|raw-rppa-matrix|proteomics-raw|sanger_methylation_taiga_id|biomarker-correctly-transposed|ccle_mirna_taiga_id|ataq_seq_taiga_id|olink_taiga_id|sanger-proteomics|depmap_paralogs_taiga_id|depmap_long_reads_gcloud_loc"}, | |||
depmap_long_reads_datasets = all {'type':'depmap_long_reads_dataset'}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@snwessel I'm a little unsure of whether I used "all" correctly here. There should be 4 artifacts of type "depmap_long_reads_dataset", and I want to get them all as a list to use in the get_all_data_availability.py script. If you have issues running this, this will be the first place to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay sounds good, thank you! I'll keep an eye out for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that depmap_long_reads_dataset won't be available in the public release, this conseq rule won't run since depmap_long_reads_datasets will result in an empty dict.
We can add this to artifacts and refactor the code accordingly to make it work.
@@ -66,29 +67,37 @@ def preprocess_omics_dataframe(df, dataset_id): | |||
4. Set ModelID as index | |||
5. Drop columns with all NaN values | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@snwessel The Long Reads files now have this IsDefaulyEntryForModel file. This makes me wonder if the datasets need to go through this preprocess_omics_dataframe
step. This is something you should probably double check with Phil before deploying these changes to prod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, that's a good point, thank you! In that case, it might be good to get @naquib314 's review on this too when he's back on Tuesday - he's probably the most familiar with how these new omics indices are handled here
assert "ACHID" in df.columns, f"Column 'ACHID' not found in file {file_name}" | ||
unique_model_ids.update(df["ACHID"].unique()) | ||
for taiga_id in taiga_ids: | ||
df = tc.get(taiga_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@snwessel If the datasets do need to use preprocess_omics_dataframe
, it should be used here: preprocess_omics_dataframe(df, taiga_id)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look good to me 👍 but I agree that it might make sense to wait to merge this until we can get Nayeem and/or Phil's input - I'll follow up with them on Tuesday
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will refactor and push
@@ -5,6 +5,7 @@ | |||
rule get_all_data_availability: | |||
inputs: | |||
artifacts=all {"type" ~ "depmap_data_taiga_id|depmap_oncref_taiga_id|rna_merged_version_taiga_id|rnai_drive_taiga_id|repurposing_matrix_taiga_id|ctd2-drug-taiga-id|gdsc_drug_taiga_id|raw-rppa-matrix|proteomics-raw|sanger_methylation_taiga_id|biomarker-correctly-transposed|ccle_mirna_taiga_id|ataq_seq_taiga_id|olink_taiga_id|sanger-proteomics|depmap_paralogs_taiga_id|depmap_long_reads_gcloud_loc"}, | |||
depmap_long_reads_datasets = all {'type':'depmap_long_reads_dataset'}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that depmap_long_reads_dataset won't be available in the public release, this conseq rule won't run since depmap_long_reads_datasets will result in an empty dict.
We can add this to artifacts and refactor the code accordingly to make it work.
Thank you for the updates @naquib314 ! I am perhaps not the best person to review pipeline changes, but overall these look good to me 👍 If you feel comfortable merging them in, feel free to do so and I can re-run the pipeline now :) or if you'd like a reviewer, we can wait until Ali is back tomorrow - either way is fine by me |
Cool! The changes should work and we can get an earlier pipeline run. I will merge it in. |
Corresponding depmap-deploy PR: https://github.com/broadinstitute/depmap-deploy/pull/99