Skip to content

Conversation

pgm
Copy link
Contributor

@pgm pgm commented Jul 29, 2025

We poll for task status after we've uploaded a file. However, our file uploads are now taking > 1 minute for large files. In the event that things go wrong, it's hard to tell whether it's stuck or just slow, so I'm adding progress updates like we have for other "long" tasks that run inside of celery.

@pgm pgm requested a review from jessica-cheng July 29, 2025 16:43
@@ -469,7 +470,9 @@ def validate_and_upload_dataset_files(
)

# TODO: Move save function to api layer. Need to make sure the db save is successful first
save_dataset_file(dataset_id, data_dfw, value_type, filestore_location)
save_dataset_file(
dataset_id, data_dfw, value_type, filestore_location, ProgressTracker()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a legacy code path, so I don't care that it's not reporting status. We're not supposed to use this anyway.

Copy link
Contributor

@jessica-cheng jessica-cheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me though I notice that ProgressTracker is only updating messages in the code block for when our df_wrapper isParquetDataFrameWrapper. I know this is mostly what we're testing but it would be nice for completeness that it updates messages in the other code path for when our df_wrapper is DataFrame

create_index_dataset(f, "features", pd.Index(df_wrapper.get_column_names()))
create_index_dataset(f, "samples", pd.Index(df_wrapper.get_index_names()))
progress.update_message("Complete")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should put this in the finally statement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finally is run on failure too.

Copy link
Contributor

@jessica-cheng jessica-cheng Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rcreasi That's true it doesn't make sense for progress to be considered complete for failures. I realize we haven't been catching for failures in this try block so I've added one recently.

@pgm
Copy link
Contributor Author

pgm commented Aug 1, 2025

@jessica-cheng

I know this is mostly what we're testing but it would be nice for completeness that it updates messages in the other code path for when our df_wrapper is DataFrame

Yes, in the other code path, there's no incremental progress to report, but I can sprinkle a few updates in so we can see which stage of the process we're at.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants