-
Notifications
You must be signed in to change notification settings - Fork 1
perf(breadbox): Etag setup #233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
(this is working locally)
for id, label in dimension_ids_and_labels.items() | ||
] | ||
# If the client already has a cached version of the data, exit early | ||
if if_none_match and if_none_match[0] == etag: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think about something like:
def get_etag_handler(etag,
generate_response,
if_none_match: Optional[List[str]] = Header(None)):
if if_none_match and if_none_match[0] == etag:
return get_not_modified_response(etag)
result = generate_response()
return get_response_with_etag(content=result, etag=etag)
def get_dimension_type_identifiers(
request: Request,
name: str,
data_type: Annotated[Union[str, None], Query()] = None,
show_only_dimensions_in_datasets: Annotated[bool, Query()] = False,
limit: Annotated[Union[int, None], Query()] = None,
db: SessionWithUser = Depends(get_db_with_user),
etag_handler = Depends(get_etag_handler)):
...
etag = hash_id_list([name] + (sorted(filtered_dataset_ids) if filtered_dataset_ids else []))
def _get_response():
if filtered_dataset_ids is None:
dataset_ids_without_metadata = None
else:
...
return result
return etag_handler(etag, _get_response):
I'm trying to lift as much of the handling of etag out of the endpoint. The version above gets us to the least amount of boilerplate and a pattern that we could copy for other endpoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually had an earlier version of this branch working with a very similar setup to what you suggesting (you can see it here), and I'd be happy to move back to that if you think that's easier to reuse for other endpoints. The reason I'd moved away from that approach was that it involved a more boilerplate than you have above.
I believe the implementation you have above creates some issues with variable scoping when the function _get_response
is actually called. For example, the internals of _get_response
wouldn't have access to variables like db
or filtered_dataset_ids
unless they're explicitly passed in by defining _get_response
with a bunch of parameters, and then using a partial like:
callback = partial(_get_response, db=db, dim_type=dim_type, filtered_dataset_ids=filtered_dataset_ids)
Alternatively, you could just define the variables in _get_response
as:
nonlocal db
nonlocal dim_type
...
Either way is a fine solution, it's just a bit more boilerplate and makes things a bit more difficult to follow. If you think that approach is easier to read and/or reuse, I'd be happy to switch back to that - I don't really have a strong preference one way or another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very surprising. Python will create a _get_response
closure for that function. nonlocal
is only needed if you want to support mutation in both inner and outer scope (which isn't the case here)
|
||
filtered_dataset_ids: Optional[list[str]] = None | ||
if data_type is not None or show_only_dimensions_in_datasets: | ||
# Get subset of datasets matching the filters provided that the user has access to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something that I hadn't thought of, that's a good point: the response is a function of the user's current access as well as the data in the DB. I think your approach here is a good one.
dataset_ids_without_metadata = None | ||
else: | ||
# Remove the metadata dataset from our list of datasets | ||
dataset_ids_without_metadata = [dataset_id for dataset_id in filtered_dataset_ids if dataset_id != dim_type.dataset_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be included as part of the list comprehension above correct? (I'm not sure I fully understand what this is for)
for id, label in dimension_ids_and_labels.items() | ||
] | ||
# If the client already has a cached version of the data, exit early | ||
if if_none_match and if_none_match[0] == etag: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very surprising. Python will create a _get_response
closure for that function. nonlocal
is only needed if you want to support mutation in both inner and outer scope (which isn't the case here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the overall logic for etag support seems fine but there are some assumptions done in your refactoring which may be incorrect. Let me know if you have any questions!
): | ||
""" | ||
For the given dimension type and filters, get all given IDs and labels. | ||
If the `data_type` is given and/or `show_only_dimensions_in_datasets` is True, the dimension identifiers that are returned will only be those that are used within a dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this comment is copied from my previous implementation and realized that it is somewhat misleading. It looks like the dimension identifiers that are returned will only be those that are used within a dataset when data_type
is given and show_only_dimensions_in_datasets = False
because usually there is an implicit filtering that is happening since we usually want to filter by a data_type
that is not metadata
. However, that is not always the case such as when data_type=metadata
which might become an issue especially if we start to create datasets where the data type can be metadata
even though it is not used as a dimension type's metadata.
I noticed that it seems like you are trying to refactor my original implementation here and while I do like some of the refactoring done here, I noticed some of the assumptions you made may be incorrect. See my PR for new test cases I added. These would fail with your implementation but pass with my original implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! That test was really helpful for clarifying what you mean. I had been assuming that these two requests would always return the same values (for any given dimension type & data type)
types/dimensions/{some_val}/identifiers?data_type={data_type}&show_only_dimensions_in_datasets=True
types/dimensions/{some_val}/identifiers?data_type={data_type}&show_only_dimensions_in_datasets=False
But I see what you mean - it's a bit more ambiguous in the case where the datatype is metadata. I'll update this to match what you described 👍
commit 8db3808 Author: Jessica Cheng <[email protected]> Date: Tue Apr 8 15:38:28 2025 -0400 Add additional test cases to cover metadata data type commit 3c594c7 Author: Jessica Cheng <[email protected]> Date: Tue Apr 8 15:37:05 2025 -0400 fix query params in test request
to better match the test case
Asana
You can try this out yourself here: https://dev.cds.team/depmap-test-perf/breadbox/elara/
A full page load with pre-cached identifiers reloads in ~2 seconds now, even for a large gene-indexed dataset (here) .
In comparison:
So in total, this is MUCH improved!
But it does also add to the complexity of this endpoint.