perf(breadbox): Etag setup #233

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

snwessel merged 18 commits into master from elara-etag-setup

Apr 10, 2025

Member

snwessel commented Apr 7, 2025 •

edited

Loading

You can try this out yourself here: https://dev.cds.team/depmap-test-perf/breadbox/elara/

A full page load with pre-cached identifiers reloads in ~2 seconds now, even for a large gene-indexed dataset (here) .

In comparison:

When we started making these performance improvements, this same page load in Elara was taking about 6 seconds
In non-elara DE2 (here), this same page load is still taking about 11 seconds
So in total, this is MUCH improved!

But it does also add to the complexity of this endpoint.

snwessel added 15 commits

April 7, 2025 11:46


          Try using ORJSON for responses

70018dd


          Try etag setup

bec61ef

(this is working locally)


          Make etag setup reusable

4122dfa


          Simplify callback setup

7bc9c82


          Make etag generation more specific to endpoint

36759f0


          Fix callback setup

84014a9


          Tidy up changes

6e9f024


          Remove unused import

64283ae


          Make changes much more readable, remove callback

951afeb


          Fix pyright error

9d0e74c


          Tidy up hashing

3ee083a


          Fix pyright error

b83c808


          Fix pyright errors

62881e6


          Fix pyright error

fd75bac


          Fix pytest

a7a5807

pgm reviewed

View reviewed changes

breadbox/breadbox/api/dimension_types.py Outdated

-                      for id, label in dimension_ids_and_labels.items()
-                  ]
+                  # If the client already has a cached version of the data, exit early
+                  if if_none_match and if_none_match[0] == etag:

Contributor

pgm Apr 8, 2025

what do you think about something like:

def get_etag_handler(etag, 
  generate_response, 
  if_none_match: Optional[List[str]] = Header(None)):

  if if_none_match and if_none_match[0] == etag:
    return get_not_modified_response(etag)

  result = generate_response()
  return get_response_with_etag(content=result, etag=etag)
  
def get_dimension_type_identifiers(
    request: Request,
    name: str,
    data_type: Annotated[Union[str, None], Query()] = None,
    show_only_dimensions_in_datasets: Annotated[bool, Query()] = False,
    limit: Annotated[Union[int, None], Query()] = None,
    db: SessionWithUser = Depends(get_db_with_user),
    etag_handler = Depends(get_etag_handler)):
...
   etag = hash_id_list([name] + (sorted(filtered_dataset_ids) if filtered_dataset_ids else []))

   def _get_response():
      if filtered_dataset_ids is None:
            dataset_ids_without_metadata = None
        else:
      ...
      return result

   return etag_handler(etag, _get_response):

I'm trying to lift as much of the handling of etag out of the endpoint. The version above gets us to the least amount of boilerplate and a pattern that we could copy for other endpoints.

Member Author

snwessel Apr 8, 2025

I actually had an earlier version of this branch working with a very similar setup to what you suggesting (you can see it here), and I'd be happy to move back to that if you think that's easier to reuse for other endpoints. The reason I'd moved away from that approach was that it involved a more boilerplate than you have above.

I believe the implementation you have above creates some issues with variable scoping when the function _get_response is actually called. For example, the internals of _get_response wouldn't have access to variables like db or filtered_dataset_ids unless they're explicitly passed in by defining _get_response with a bunch of parameters, and then using a partial like:

callback = partial(_get_response, db=db, dim_type=dim_type, filtered_dataset_ids=filtered_dataset_ids)

Alternatively, you could just define the variables in _get_response as:

nonlocal db
nonlocal dim_type
...

Either way is a fine solution, it's just a bit more boilerplate and makes things a bit more difficult to follow. If you think that approach is easier to read and/or reuse, I'd be happy to switch back to that - I don't really have a strong preference one way or another.

Contributor

pgm Apr 8, 2025

This is very surprising. Python will create a _get_response closure for that function. nonlocal is only needed if you want to support mutation in both inner and outer scope (which isn't the case here)

pgm reviewed

View reviewed changes

breadbox/breadbox/api/dimension_types.py

+                  filtered_dataset_ids: Optional[list[str]] = None
+                  if data_type is not None or show_only_dimensions_in_datasets:
+                      # Get subset of datasets matching the filters provided that the user has access to

Contributor

pgm Apr 8, 2025 •

edited

Loading

Something that I hadn't thought of, that's a good point: the response is a function of the user's current access as well as the data in the DB. I think your approach here is a good one.

snwessel marked this pull request as ready for review

April 8, 2025 15:31

snwessel requested a review from jessica-cheng

April 8, 2025 15:31

pgm reviewed

View reviewed changes

breadbox/breadbox/api/dimension_types.py Outdated

+                          dataset_ids_without_metadata = None
+                      else:
+                          # Remove the metadata dataset from our list of datasets
+                          dataset_ids_without_metadata = [dataset_id for dataset_id in filtered_dataset_ids if dataset_id != dim_type.dataset_id]

Contributor

pgm Apr 8, 2025

This could be included as part of the list comprehension above correct? (I'm not sure I fully understand what this is for)

breadbox/breadbox/api/dimension_types.py Outdated

-                      for id, label in dimension_ids_and_labels.items()
-                  ]
+                  # If the client already has a cached version of the data, exit early
+                  if if_none_match and if_none_match[0] == etag:

Contributor

pgm Apr 8, 2025

This is very surprising. Python will create a _get_response closure for that function. nonlocal is only needed if you want to support mutation in both inner and outer scope (which isn't the case here)

jessica-cheng reviewed

View reviewed changes

Contributor

jessica-cheng left a comment

I think the overall logic for etag support seems fine but there are some assumptions done in your refactoring which may be incorrect. Let me know if you have any questions!

breadbox/breadbox/api/dimension_types.py

               ):
+                  """
+                  For the given dimension type and filters, get all given IDs and labels.
+                  If the `data_type` is given and/or `show_only_dimensions_in_datasets` is True, the dimension identifiers that are returned will only be those that are used within a dataset.

Contributor

jessica-cheng Apr 8, 2025 •

edited

Loading

I noticed this comment is copied from my previous implementation and realized that it is somewhat misleading. It looks like the dimension identifiers that are returned will only be those that are used within a dataset when data_type is given and show_only_dimensions_in_datasets = False because usually there is an implicit filtering that is happening since we usually want to filter by a data_type that is not metadata. However, that is not always the case such as when data_type=metadata which might become an issue especially if we start to create datasets where the data type can be metadata even though it is not used as a dimension type's metadata.

I noticed that it seems like you are trying to refactor my original implementation here and while I do like some of the refactoring done here, I noticed some of the assumptions you made may be incorrect. See my PR for new test cases I added. These would fail with your implementation but pass with my original implementation.

Member Author

snwessel Apr 10, 2025

Thank you! That test was really helpful for clarifying what you mean. I had been assuming that these two requests would always return the same values (for any given dimension type & data type)

types/dimensions/{some_val}/identifiers?data_type={data_type}&show_only_dimensions_in_datasets=True
types/dimensions/{some_val}/identifiers?data_type={data_type}&show_only_dimensions_in_datasets=False

But I see what you mean - it's a bit more ambiguous in the case where the datatype is metadata. I'll update this to match what you described 👍

snwessel added 3 commits

April 10, 2025 10:20


          Refactor to use _get_response method

def62e2


          Squashed commit of the following:

e8ba588

commit 8db3808
Author: Jessica Cheng <[email protected]>
Date:   Tue Apr 8 15:38:28 2025 -0400

    Add additional test cases to cover metadata data type

commit 3c594c7
Author: Jessica Cheng <[email protected]>
Date:   Tue Apr 8 15:37:05 2025 -0400

    fix query params in test request


          Update dataset id filtering

023ed9d

to better match the test case

snwessel merged commit f4c34d9 into master

4 checks passed

snwessel deleted the elara-etag-setup branch

April 10, 2025 16:04

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet