-
Notifications
You must be signed in to change notification settings - Fork 606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Query Performance counts with count scans #5538
Conversation
Important Review skippedReview was skipped due to path filters ⛔ Files ignored due to path filters (1)
CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including You can disable this status message by setting the WalkthroughThis pull request updates aggregation-related code across multiple modules. It introduces a new query performance flag into the state management, GraphQL schema, backend aggregation logic, and associated tests. The changes add new fields and parameters (such as Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Server
participant Resolver
participant Aggregator
Client->>Server: Send aggregation request with queryPerformance flag
Server->>Resolver: Process AggregationForm
alt queryPerformance is False
Resolver->>Aggregator: Append standard aggregations
else queryPerformance is True
Resolver->>Aggregator: Skip performance-intensive aggregations
end
Aggregator-->>Resolver: Return aggregation results
Resolver-->>Server: Forward response
Server-->>Client: Return final response
sequenceDiagram
participant AggregationQuery
participant CountInstance
AggregationQuery->>CountInstance: Call to_mongo(sample_collection)
alt _optimize is False
CountInstance->>CountInstance: Append match stage to pipeline
else _optimize is True
CountInstance->>CountInstance: Skip match stage for optimization
end
CountInstance-->>AggregationQuery: Return aggregation data
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
fiftyone/core/collections.py (1)
10229-10275
: New async aggregation method for optimized query performance.A new asynchronous aggregation method has been added that appears to be part of the query performance optimization. The method only processes facet-type aggregations, which aligns with the PR objective of optimizing sidebar counts in Query Performance mode.
The
debug
parameter is accepted but not used within the method body. This suggests it's intended for future use or debugging. Consider adding a docstring comment to clarify the purpose of this parameter or leverage it within the method implementation if it's already intended for use.- async def _async_aggregate(self, aggregations, debug=False): + async def _async_aggregate(self, aggregations, debug=False): + """Asynchronously aggregates data for query performance optimization. + + Args: + aggregations: Aggregation instances to process + debug (False): Whether to include additional debugging information + + Returns: + The aggregation results + """fiftyone/core/aggregations.py (1)
588-593
: Implemented the key optimization logic.This change implements the core of the optimization described in the PR objectives:
- When
_optimize
is true, it skips adding a match stage that filters for non-null values- This significantly reduces processing time for large datasets
The code matches the logic with the business requirement of optimizing sidebar counts in query performance mode.
However, there's a minor code style improvement opportunity:
-if not self._optimize: - if not sample_collection._contains_videos() or path != "frames": - pipeline.append( - {"$match": {"$expr": {"$gt": ["$" + path, None]}}} - ) +if not self._optimize and (not sample_collection._contains_videos() or path != "frames"): + pipeline.append( + {"$match": {"$expr": {"$gt": ["$" + path, None]}}} + )This small refactor combines the nested conditionals for better readability.
🧰 Tools
🪛 Ruff (0.8.2)
588-589: Use a single
if
statement instead of nestedif
statements(SIM102)
fiftyone/server/aggregations.py (1)
234-254
: Effective conditional aggregation logic for performance optimization.This is the core optimization - when query_performance is True, only the basic Count aggregation is performed, skipping additional aggregations like CountValues and Bounds. This should significantly reduce the computational load during queries, which aligns with the PR objective of enhancing query performance.
Consider adding a brief comment explaining the performance implications of this change for future maintainers.
- if not query_performance: + # Skip additional aggregations when query_performance is enabled to optimize query execution + if not query_performance:
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (4)
app/packages/app/src/pages/datasets/__generated__/DatasetPageQuery.graphql.ts
is excluded by!**/__generated__/**
,!**/__generated__/**
app/packages/relay/src/fragments/__generated__/estimatedCountsFragment.graphql.ts
is excluded by!**/__generated__/**
,!**/__generated__/**
app/packages/relay/src/queries/__generated__/aggregationsQuery.graphql.ts
is excluded by!**/__generated__/**
,!**/__generated__/**
app/packages/relay/src/queries/__generated__/datasetQuery.graphql.ts
is excluded by!**/__generated__/**
,!**/__generated__/**
📒 Files selected for processing (6)
app/packages/state/src/recoil/aggregations.ts
(2 hunks)app/schema.graphql
(4 hunks)fiftyone/core/aggregations.py
(6 hunks)fiftyone/core/collections.py
(1 hunks)fiftyone/server/aggregations.py
(5 hunks)tests/unittests/server_aggregations_tests.py
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{ts,tsx}`: Review the Typescript and React code for co...
**/*.{ts,tsx}
: Review the Typescript and React code for conformity with best practices in React, Recoil, Graphql, and Typescript. Highlight any deviations.
app/packages/state/src/recoil/aggregations.ts
🪛 Ruff (0.8.2)
fiftyone/core/aggregations.py
588-589: Use a single if
statement instead of nested if
statements
(SIM102)
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
- GitHub Check: test / test-app
- GitHub Check: lint / eslint
- GitHub Check: build / build
- GitHub Check: e2e / test-e2e
- GitHub Check: build
🔇 Additional comments (21)
app/packages/state/src/recoil/aggregations.ts (2)
16-16
: Added import for the new query performance feature.The addition of this import brings in the queryPerformance state atom that will be used to optimize aggregation operations.
89-89
: Added queryPerformance flag to enable performance optimization.The newly added property forwards the query performance flag from the client state to the GraphQL server. This enables the optimization described in the PR objectives that significantly improves database response time for sidebar counts.
tests/unittests/server_aggregations_tests.py (4)
97-118
: Code refactoring improves test readability.Restructuring the
form
dictionary into a standalone variable improves code readability and maintains better separation of test input creation from execution.
119-120
: No change in test behavior during initial execution.This execute call still retains the same functionality as the original code, just with improved structure.
136-138
: Added key test case for query performance optimization.This new test case validates that when
query_performance
is enabled, the aggregation engine properly skips unnecessary processing by returningNone
for the values field.
139-152
: Test assertions validate the optimization's correctness.The assertions verify that with query performance enabled:
- The essential count information is still returned correctly (count: 1, exists: 1)
- The values array is set to None, confirming that the optimization is working as designed
This matches the PR objective of "removing all aggregations except for the Count" to improve query performance.
app/schema.graphql (4)
35-35
: Added queryPerformance flag to GraphQL schema.The new parameter
queryPerformance
in the AggregationForm type allows clients to explicitly request performance optimization for aggregation operations. Setting it to defaultfalse
maintains backward compatibility with existing clients.
242-242
: Updated estimatedSampleCount to be nullable.Making this field nullable (by changing from
Int!
toInt
) increases flexibility in cases where the count might not be available or calculable, especially when performance optimization is enabled.
617-621
: Improved dataset query method signature formatting and nullability.The changes here:
- Improve readability by placing parameters on separate lines
- Update the return type to
Dataset!
making the return type non-nullable when the dataset existsThis change ensures better type safety in the GraphQL schema.
803-803
: Made values field nullable to support query performance optimization.Changing
values
from[StringAggregationValue!]!
to[StringAggregationValue!]
allows the field to be null when query performance is enabled, which is consistent with the optimization strategy described in the PR objectives.fiftyone/core/aggregations.py (5)
540-546
: Added optimization parameter to Count constructor.The addition of the
_optimize
parameter gives the Count aggregation the ability to skip unnecessary operations for performance improvement.
549-550
: Stored optimization flag as an instance attribute.Properly storing the flag enables its use in the
to_mongo
method later.
585-586
: Passed optimization flag to field parser.This ensures the optimization flag is properly forwarded to the parsing logic.
3006-3006
: Added optimize parameter to parse function.The addition of the
optimize
parameter to_parse_field_and_expr
allows this core function to properly support the query performance optimization feature.
3113-3114
: Extended optimization to avoid unnecessary projections.This change further enhances query performance by conditionally skipping projection stages when the optimize flag is enabled. The optimization is properly scoped to non-context cases to avoid affecting other use cases.
fiftyone/server/aggregations.py (6)
45-45
: Good addition of the query_performance flag.Adding this flag allows controlling performance optimization mode while maintaining backward compatibility with the default value of
False
.
96-96
: Appropriate update to make values optional.Making
values
optional is consistent with the new behavior whereCountValues
aggregation may be skipped when query performance optimization is enabled.
144-149
: Correctly passing query_performance flag to aggregation resolver.The change correctly forwards the query_performance flag to the _resolve_path_aggregation function for each path.
216-217
: Function signature appropriately updated.The function signature now includes the query_performance parameter which is used to control aggregation behavior.
219-221
:✅ Verification successful
Optimization for Count aggregation.
Good use of the _optimize parameter to optimize the Count aggregation when query_performance is enabled.
🏁 Script executed:
#!/bin/bash # Verify if the Count class in the fiftyone.core.aggregations module supports the _optimize parameter grep -r "_optimize" --include="*.py" fiftyone/core/aggregations.pyLength of output: 194
Count Aggregation _optimize Parameter Confirmation — Approved
The verification confirms that the core aggregation code correctly supports the _optimize parameter (defaulting to False and used to control the optimization behavior). The use of the _optimize parameter in the Count aggregation is implemented as intended.
257-309
:✅ Verification successful
Review the result processing logic for missing aggregations.
The from_results function needs to handle cases where certain aggregations weren't performed due to query_performance being True. While the current implementation seems to handle it correctly because it only processes aggregations that were actually performed, verify that no assumptions are made about the presence of specific aggregation results.
For example, when query_performance is True, data["values"] won't be set for StringAggregation since CountValues isn't performed. Ensure that all client code that consumes these results can handle missing fields.
🏁 Script executed:
#!/bin/bash # Look for places where StringAggregation is used and check if they handle the case where values is None grep -r "StringAggregation" --include="*.py" --include="*.js" --include="*.ts" fiftyone/ grep -r "values.*StringAggregation" --include="*.py" --include="*.js" --include="*.ts" fiftyone/Length of output: 712
Final Verification: Missing Aggregations Are Handled as Expected
After reviewing the processing logic in
fiftyone/server/aggregations.py
and verifying via grep that theStringAggregation
results are defined with an optionalvalues
field (defaulting toNone
), it appears that the code correctly processes only the aggregations that were executed. In the case wherequery_performance
isTrue
, the absence of avalues
key (for instance, in aStringAggregation
) is intentional, and the Optional type annotation ensures that client code should be prepared to handle a missing field.
- The optional definition (
values: t.Optional[t.List[StringAggregationValue]] = None
) confirms that no value is expected when count values aren’t performed.- Clients consuming these aggregation results must verify the presence of a field before use, which is consistent with how the aggregation results are processed.
No changes are required here as long as consumers properly check for the
None
value.
1bcf43b
to
8e580a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
fiftyone/core/aggregations.py (1)
588-593
: Core optimization logic effectively reduces computational load.This conditional is the key performance optimization in this PR. By skipping the match stage when
_optimize
is True, the aggregation pipeline becomes more efficient, which explains the 4x performance improvement mentioned in the PR objectives.Consider simplifying the nested if statements as suggested by static analysis:
- if not self._optimize: - if not sample_collection._contains_videos() or path != "frames": - pipeline.append( - {"$match": {"$expr": {"$gt": ["$" + path, None]}}} - ) + if not self._optimize and (not sample_collection._contains_videos() or path != "frames"): + pipeline.append( + {"$match": {"$expr": {"$gt": ["$" + path, None]}}} + )🧰 Tools
🪛 Ruff (0.8.2)
588-589: Use a single
if
statement instead of nestedif
statements(SIM102)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (4)
app/packages/app/src/pages/datasets/__generated__/DatasetPageQuery.graphql.ts
is excluded by!**/__generated__/**
,!**/__generated__/**
app/packages/relay/src/fragments/__generated__/estimatedCountsFragment.graphql.ts
is excluded by!**/__generated__/**
,!**/__generated__/**
app/packages/relay/src/queries/__generated__/aggregationsQuery.graphql.ts
is excluded by!**/__generated__/**
,!**/__generated__/**
app/packages/relay/src/queries/__generated__/datasetQuery.graphql.ts
is excluded by!**/__generated__/**
,!**/__generated__/**
📒 Files selected for processing (5)
app/packages/state/src/recoil/aggregations.ts
(2 hunks)app/schema.graphql
(4 hunks)fiftyone/core/aggregations.py
(6 hunks)fiftyone/server/aggregations.py
(5 hunks)tests/unittests/server_aggregations_tests.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- app/packages/state/src/recoil/aggregations.ts
- tests/unittests/server_aggregations_tests.py
- app/schema.graphql
- fiftyone/server/aggregations.py
🧰 Additional context used
🪛 Ruff (0.8.2)
fiftyone/core/aggregations.py
588-589: Use a single if
statement instead of nested if
statements
(SIM102)
🔇 Additional comments (5)
fiftyone/core/aggregations.py (5)
540-547
: Addition of optimization parameter enhances query performance.The addition of
_optimize
parameter to theCount
class constructor is a good implementation for the query performance enhancement. It provides a way to control whether to optimize the count aggregation pipeline for performance.
549-550
: LGTM! Parameter storage is consistent with class style.The parameter is properly stored as an instance attribute, consistent with the class's coding style and pattern.
585-586
: LGTM! Parameter correctly passed to parsing function.The
optimize
parameter is correctly passed to the_parse_field_and_expr
function.
3006-3007
: LGTM! Added parameter to support optimization logic.The
optimize
parameter with a default value ofFalse
is correctly added to the_parse_field_and_expr
function.
3113-3114
: LGTM! Condition modified to account for optimization.The condition now checks for both context and optimize flags before appending to the pipeline, which is consistent with the optimization goal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this removes label counts when qp is enabled? Not sure if this should go through product first
@@ -42,6 +42,7 @@ class AggregationForm: | |||
slices: t.Optional[t.List[str]] | |||
view: BSONArray | |||
view_name: t.Optional[str] = None | |||
query_performance: t.Optional[bool] = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does this value come from?/how is it set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It comes from the App setting @fiftyone/state/queryPerformance
atom
@@ -88,7 +88,7 @@ export type datasetQuery$data = { | |||
readonly slug: string | null; | |||
} | null; | |||
readonly " $fragmentSpreads": FragmentRefs<"datasetFragment">; | |||
} | null; | |||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this safe? Could no dataset due to no permissions cause problems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be changing. I need to figure out why generated output is different. Thanks!
field_or_expr=None, | ||
expr=None, | ||
safe=False, | ||
_optimize=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't we set it to true by default? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a note. I'd like to do this in the future, but for now I am confident that the server only uses Count
in a way that makes this optimization valid. I am unsure at the moment if this is true for all Count
usage in the SDK.
@@ -238,7 +239,7 @@ type Dataset { | |||
appConfig: DatasetAppConfig | |||
info: JSON | |||
estimatedFrameCount: Int | |||
estimatedSampleCount: Int! | |||
estimatedSampleCount: Int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear to me why the required !
was removed. I will follow up
This removes nothing from the UI. It only omits queries that are not needed by QP sidebar UI, but are required by the older non-QP sidebar mode. So when QP is enabled, we omit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks good to me, thanks Ben 🚢 small comments otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
fiftyone/core/aggregations.py (1)
553-554
: Consider adding _optimize to _kwargs methodFor consistency with other parameters, consider adding the
_optimize
parameter to the_kwargs
method. This would ensure that serialization and deserialization ofCount
aggregations include the optimization settings.def _kwargs(self): - return super()._kwargs() + [["_unwind", self._unwind]] + return super()._kwargs() + [["_unwind", self._unwind], ["_optimize", self._optimize]]
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
app/schema.graphql
(3 hunks)fiftyone/core/aggregations.py
(6 hunks)fiftyone/server/aggregations.py
(7 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- app/schema.graphql
- fiftyone/server/aggregations.py
🧰 Additional context used
🪛 Ruff (0.8.2)
fiftyone/core/aggregations.py
589-590: Use a single if
statement instead of nested if
statements
(SIM102)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.9)
- GitHub Check: test / test-app
- GitHub Check: build / build
- GitHub Check: lint / eslint
- GitHub Check: e2e / test-e2e
- GitHub Check: build
🔇 Additional comments (6)
fiftyone/core/aggregations.py (6)
539-550
: Introduces optimization flag for Count aggregationThe addition of the
_optimize
parameter with a default value ofFalse
allows for conditionally skipping certain pipeline stages when counting samples, which should improve performance without changing the API's backward compatibility.
585-586
: Propagates optimization flag to parsing functionThe
_optimize
parameter is correctly passed to the_parse_field_and_expr
function, ensuring consistent behavior throughout the pipeline construction.
3007-3008
: Adds optimize parameter to parsing functionThe function signature is updated to accept the new
optimize
parameter with default valueFalse
, maintaining backward compatibility.
3114-3115
: Conditionally skip project stage when optimize is TrueThis optimization skips the pipeline stage that projects the path when
optimize
isTrue
and there's no context, further improving performance.
545-545
: Consider adding documentation for the _optimize parameterThe
_optimize
parameter is an important optimization feature, but it's not documented in the class docstring. Consider adding documentation to explain its purpose, usage, and impact.Note: I see from previous review comments that you're considering making this
True
by default once validated for all SDK usage cases.
588-594
:✅ Verification successful
Conditionally skip None-value filtering when _optimize is True
This change allows the pipeline to skip the match stage that filters out None values when the
_optimize
flag is set, which should reduce computational overhead. The TODO comment suggests the team is considering if this optimization could be applied more broadly.
🏁 Script executed:
#!/bin/bash # Search for other usages of Count aggregation in the codebase to ensure this optimization is safe rg -A 2 "fo\.Count\(" --type pyLength of output: 1824
Review Verified: None-value Filtering Optimization Is Safe
The search results for
fo.Count(
indicate that this optimization only affects the filtering ofNone
values when_optimize
is set. There are no conflicting usages in either the core aggregations or test cases, and the tests confirm that the Count aggregation behaves as expected.
- The optimization correctly skips the
$match
filtering stage when_optimize
is True.- No evidence suggests that this change adversely affects any
Count
aggregation usage.- The TODO note about broader application of this optimization remains valid for future exploration.
🧰 Tools
🪛 Ruff (0.8.2)
589-590: Use a single
if
statement instead of nestedif
statements(SIM102)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
What changes are proposed in this pull request?
With Query Performance enabled, excessive aggregation were still being run from the original sidebar mode.
Removing all aggregations except
Count
when query performance is enabled, and ensuring count scans occur when possible, yields at least 4x faster db responses for sample level scalars. Tested with BDD100k (69,863 samples). 172ms db time to 39ms db time.How is this patch tested? If it is not, please explain why.
Server aggregation assertion
Release Notes
Query Performance <app-optimizing-query-performance>
modeWhat areas of FiftyOne does this PR affect?
fiftyone
Python library changesSummary by CodeRabbit
New Features
queryPerformance
flag to aggregation forms to influence performance behavior.Schema Updates
Tests