Skip to content

Conversation

snwessel
Copy link
Member

@snwessel snwessel commented Apr 7, 2025

Asana

You can try this out yourself here: https://dev.cds.team/depmap-test-perf/breadbox/elara/

A full page load with pre-cached identifiers reloads in ~2 seconds now, even for a large gene-indexed dataset (here) .

In comparison:

  • When we started making these performance improvements, this same page load in Elara was taking about 6 seconds
  • In non-elara DE2 (here), this same page load is still taking about 11 seconds
    So in total, this is MUCH improved!

But it does also add to the complexity of this endpoint.

for id, label in dimension_ids_and_labels.items()
]
# If the client already has a cached version of the data, exit early
if if_none_match and if_none_match[0] == etag:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about something like:

def get_etag_handler(etag, 
  generate_response, 
  if_none_match: Optional[List[str]] = Header(None)):

  if if_none_match and if_none_match[0] == etag:
    return get_not_modified_response(etag)

  result = generate_response()
  return get_response_with_etag(content=result, etag=etag)
  
def get_dimension_type_identifiers(
    request: Request,
    name: str,
    data_type: Annotated[Union[str, None], Query()] = None,
    show_only_dimensions_in_datasets: Annotated[bool, Query()] = False,
    limit: Annotated[Union[int, None], Query()] = None,
    db: SessionWithUser = Depends(get_db_with_user),
    etag_handler = Depends(get_etag_handler)):
...
   etag = hash_id_list([name] + (sorted(filtered_dataset_ids) if filtered_dataset_ids else []))

   def _get_response():
      if filtered_dataset_ids is None:
            dataset_ids_without_metadata = None
        else:
      ...
      return result

   return etag_handler(etag, _get_response):

I'm trying to lift as much of the handling of etag out of the endpoint. The version above gets us to the least amount of boilerplate and a pattern that we could copy for other endpoints.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually had an earlier version of this branch working with a very similar setup to what you suggesting (you can see it here), and I'd be happy to move back to that if you think that's easier to reuse for other endpoints. The reason I'd moved away from that approach was that it involved a more boilerplate than you have above.

I believe the implementation you have above creates some issues with variable scoping when the function _get_response is actually called. For example, the internals of _get_response wouldn't have access to variables like db or filtered_dataset_ids unless they're explicitly passed in by defining _get_response with a bunch of parameters, and then using a partial like:

callback = partial(_get_response, db=db, dim_type=dim_type, filtered_dataset_ids=filtered_dataset_ids)

Alternatively, you could just define the variables in _get_response as:

nonlocal db
nonlocal dim_type
...

Either way is a fine solution, it's just a bit more boilerplate and makes things a bit more difficult to follow. If you think that approach is easier to read and/or reuse, I'd be happy to switch back to that - I don't really have a strong preference one way or another.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very surprising. Python will create a _get_response closure for that function. nonlocal is only needed if you want to support mutation in both inner and outer scope (which isn't the case here)


filtered_dataset_ids: Optional[list[str]] = None
if data_type is not None or show_only_dimensions_in_datasets:
# Get subset of datasets matching the filters provided that the user has access to
Copy link
Contributor

@pgm pgm Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that I hadn't thought of, that's a good point: the response is a function of the user's current access as well as the data in the DB. I think your approach here is a good one.

@snwessel snwessel marked this pull request as ready for review April 8, 2025 15:31
@snwessel snwessel requested a review from jessica-cheng April 8, 2025 15:31
dataset_ids_without_metadata = None
else:
# Remove the metadata dataset from our list of datasets
dataset_ids_without_metadata = [dataset_id for dataset_id in filtered_dataset_ids if dataset_id != dim_type.dataset_id]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be included as part of the list comprehension above correct? (I'm not sure I fully understand what this is for)

for id, label in dimension_ids_and_labels.items()
]
# If the client already has a cached version of the data, exit early
if if_none_match and if_none_match[0] == etag:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very surprising. Python will create a _get_response closure for that function. nonlocal is only needed if you want to support mutation in both inner and outer scope (which isn't the case here)

Copy link
Contributor

@jessica-cheng jessica-cheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the overall logic for etag support seems fine but there are some assumptions done in your refactoring which may be incorrect. Let me know if you have any questions!

):
"""
For the given dimension type and filters, get all given IDs and labels.
If the `data_type` is given and/or `show_only_dimensions_in_datasets` is True, the dimension identifiers that are returned will only be those that are used within a dataset.
Copy link
Contributor

@jessica-cheng jessica-cheng Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this comment is copied from my previous implementation and realized that it is somewhat misleading. It looks like the dimension identifiers that are returned will only be those that are used within a dataset when data_type is given and show_only_dimensions_in_datasets = False because usually there is an implicit filtering that is happening since we usually want to filter by a data_type that is not metadata. However, that is not always the case such as when data_type=metadata which might become an issue especially if we start to create datasets where the data type can be metadata even though it is not used as a dimension type's metadata.

I noticed that it seems like you are trying to refactor my original implementation here and while I do like some of the refactoring done here, I noticed some of the assumptions you made may be incorrect. See my PR for new test cases I added. These would fail with your implementation but pass with my original implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! That test was really helpful for clarifying what you mean. I had been assuming that these two requests would always return the same values (for any given dimension type & data type)

  • types/dimensions/{some_val}/identifiers?data_type={data_type}&show_only_dimensions_in_datasets=True
  • types/dimensions/{some_val}/identifiers?data_type={data_type}&show_only_dimensions_in_datasets=False

But I see what you mean - it's a bit more ambiguous in the case where the datatype is metadata. I'll update this to match what you described 👍

commit 8db3808
Author: Jessica Cheng <[email protected]>
Date:   Tue Apr 8 15:38:28 2025 -0400

    Add additional test cases to cover metadata data type

commit 3c594c7
Author: Jessica Cheng <[email protected]>
Date:   Tue Apr 8 15:37:05 2025 -0400

    fix query params in test request
to better match the test case
@snwessel snwessel merged commit f4c34d9 into master Apr 10, 2025
4 checks passed
@snwessel snwessel deleted the elara-etag-setup branch April 10, 2025 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants