-
Notifications
You must be signed in to change notification settings - Fork 74
929 feature extend is aggr check funcs #951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…indowing functions in DQX. Parameter ordering was changed accidentaly.
|
✅ 457/457 passed, 1 flaky, 41 skipped, 2h47m9s total Flaky tests:
Running from acceptance #3304 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR extends the aggregate check functions (is_aggr_*) from supporting 5 basic aggregates to 20 curated functions using a hybrid "Curated + Custom" approach. The implementation adds support for statistical functions (stddev, variance, median, mode), percentile functions (percentile, approx_percentile), and cardinality functions (count_distinct, approx_count_distinct), while also enabling custom aggregates with runtime validation and clear error messages.
Key Changes:
- Added 15 new curated aggregate functions (total: 20) organized by category (statistical, cardinality, percentiles)
- Implemented custom aggregate support with UserWarning mechanism and runtime validation
- Added
aggr_paramsparameter to all 4is_aggr_*functions for parameterized aggregates
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/databricks/labs/dqx/check_funcs.py | Core implementation: added CURATED_AGGR_FUNCTIONS set, aggr_params parameter, _build_aggregate_expression and _validate_aggregate_return_type helper functions, enhanced validation logic |
| tests/unit/test_row_checks.py | Updated test to verify warning behavior for invalid aggr_type instead of immediate error |
| tests/integration/test_dataset_checks.py | Added comprehensive integration tests for new aggregate functions including count_distinct, statistical functions, percentiles, and custom aggregates |
| docs/dqx/docs/reference/quality_checks.mdx | Added documentation for aggregate function types, categorization, and usage examples |
| docs/dqx/docs/guide/quality_checks_definition.mdx | Added practical use case examples for extended aggregates |
| demos/dqx_demo_library.py | Added 5 demo examples showcasing new aggregate functions in real-world scenarios |
Comments suppressed due to low confidence (2)
tests/integration/test_dataset_checks.py:1
- The documentation example contradicts the implementation. According to the code in check_funcs.py (lines 2385-2391),
count_distinctcannot be used withgroup_bydue to a Spark limitation. This example will fail at runtime with anInvalidParameterError. Either remove thegroup_byparameter or changeaggr_typetoapprox_count_distinct.
from collections.abc import Callable
docs/dqx/docs/guide/quality_checks_definition.mdx:1
- The admonition correctly documents the
count_distinctlimitation, but this contradicts the example at lines 195-198 which showscount_distinctbeing used withgroup_by. The example should be updated to match this documented limitation.
---
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
…entation. docs updated with more user friendly language.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #951 +/- ##
==========================================
+ Coverage 90.07% 90.11% +0.04%
==========================================
Files 64 64
Lines 6138 6174 +36
==========================================
+ Hits 5529 5564 +35
- Misses 609 610 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
9416b75 to
20870c8
Compare
…nd quality_checks.mdx
1. removed dead code which was "just in case" 2. added test for incorrect parameters 3. More permissive parameter passing to aggr functions
…lude_no_tables_matching
Changes
Extended the is_aggr_* check functions from supporting 5 basic aggregates to 20 curated aggregate functions with a hybrid "Curated + Custom" approach.
Added aggr_params: dict[str, Any] to all 4 is_aggr_* functions for aggregates requiring parameters (e.g., percentile, approx_percentile).
Implemented two-stage aggregation (groupBy + join) for window-incompatible aggregates like count_distinct.
Linked issues
closes #933 and #929
Resolves #..
Tests