Skip to content

Conversation

@vb-dbrks
Copy link
Contributor

@vb-dbrks vb-dbrks commented Nov 27, 2025

Changes

Extended the is_aggr_* check functions from supporting 5 basic aggregates to 20 curated aggregate functions with a hybrid "Curated + Custom" approach.

  1. Curated Aggregate Functions (20 total) Added 15 new aggregate functions.
  • Cardinality: count_distinct, approx_count_distinct, count_if
  • Statistical: stddev, stddev_pop, stddev_samp, variance, var_pop, var_samp, median, mode, skewness, kurtosis
  • Percentile: percentile, approx_percentile
  1. Custom Aggregate Support.
  • Warning mechanism for non-curated aggregates (UserWarning, still executes)
  • Runtime validation with clear error messages for invalid return types
  • Human-readable violation messages (e.g., "Distinct value count 2..." instead of "Count_distinct 2...")
  1. New aggr_params Parameter
    Added aggr_params: dict[str, Any] to all 4 is_aggr_* functions for aggregates requiring parameters (e.g., percentile, approx_percentile).
  2. count_distinct with group_by Support
    Implemented two-stage aggregation (groupBy + join) for window-incompatible aggregates like count_distinct.
  3. Bug Fixes & Improvements
  • Fixed flaky test test_apply_checks_and_save_in_tables_for_patterns_exclude_no_tables_matching
  • Added performance benchmarks for count_distinct vs approx_count_distinct
  • Updated documentation with usage examples and performance guidance

Linked issues

closes #933 and #929

Resolves #..

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • added end-to-end tests
  • added performance tests

1. is_aggr with group_by and
2. updated demo library and removed
3. bivariate analysis aggr functions
…indowing functions in DQX. Parameter ordering was changed accidentaly.
@github-actions
Copy link

github-actions bot commented Nov 27, 2025

✅ 457/457 passed, 1 flaky, 41 skipped, 2h47m9s total

Flaky tests:

  • 🤪 test_e2e_workflow_serverless (9m48.152s)

Running from acceptance #3304

@mwojtyczka mwojtyczka requested a review from Copilot November 27, 2025 17:05
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the aggregate check functions (is_aggr_*) from supporting 5 basic aggregates to 20 curated functions using a hybrid "Curated + Custom" approach. The implementation adds support for statistical functions (stddev, variance, median, mode), percentile functions (percentile, approx_percentile), and cardinality functions (count_distinct, approx_count_distinct), while also enabling custom aggregates with runtime validation and clear error messages.

Key Changes:

  • Added 15 new curated aggregate functions (total: 20) organized by category (statistical, cardinality, percentiles)
  • Implemented custom aggregate support with UserWarning mechanism and runtime validation
  • Added aggr_params parameter to all 4 is_aggr_* functions for parameterized aggregates

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/databricks/labs/dqx/check_funcs.py Core implementation: added CURATED_AGGR_FUNCTIONS set, aggr_params parameter, _build_aggregate_expression and _validate_aggregate_return_type helper functions, enhanced validation logic
tests/unit/test_row_checks.py Updated test to verify warning behavior for invalid aggr_type instead of immediate error
tests/integration/test_dataset_checks.py Added comprehensive integration tests for new aggregate functions including count_distinct, statistical functions, percentiles, and custom aggregates
docs/dqx/docs/reference/quality_checks.mdx Added documentation for aggregate function types, categorization, and usage examples
docs/dqx/docs/guide/quality_checks_definition.mdx Added practical use case examples for extended aggregates
demos/dqx_demo_library.py Added 5 demo examples showcasing new aggregate functions in real-world scenarios
Comments suppressed due to low confidence (2)

tests/integration/test_dataset_checks.py:1

  • The documentation example contradicts the implementation. According to the code in check_funcs.py (lines 2385-2391), count_distinct cannot be used with group_by due to a Spark limitation. This example will fail at runtime with an InvalidParameterError. Either remove the group_by parameter or change aggr_type to approx_count_distinct.
from collections.abc import Callable

docs/dqx/docs/guide/quality_checks_definition.mdx:1

  • The admonition correctly documents the count_distinct limitation, but this contradicts the example at lines 195-198 which shows count_distinct being used with group_by. The example should be updated to match this documented limitation.
---

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

…entation. docs updated with more user friendly language.
@codecov
Copy link

codecov bot commented Nov 27, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.11%. Comparing base (d200468) to head (24748ae).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #951      +/-   ##
==========================================
+ Coverage   90.07%   90.11%   +0.04%     
==========================================
  Files          64       64              
  Lines        6138     6174      +36     
==========================================
+ Hits         5529     5564      +35     
- Misses        609      610       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vb-dbrks vb-dbrks force-pushed the 929-feature-extend-is_aggr-check_funcs branch from 9416b75 to 20870c8 Compare December 2, 2025 16:36
1. removed dead code which was "just in case"
2. added test for incorrect parameters
3. More permissive parameter passing to aggr functions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Add count distinct [FEATURE]: Extend is_aggr check_funcs to support more aggregate functions

3 participants