Skip to content

feat: propagate property numerical statistics#204

Merged
AndreaBozzo merged 2 commits intomasterfrom
ab/prop-num-stats
Feb 2, 2026
Merged

feat: propagate property numerical statistics#204
AndreaBozzo merged 2 commits intomasterfrom
ab/prop-num-stats

Conversation

@AndreaBozzo
Copy link
Owner

@AndreaBozzo AndreaBozzo commented Feb 2, 2026

propagation of numerical statistics properties.
remaining work ported into issue #203

@AndreaBozzo AndreaBozzo changed the title Add property numerical statistics feat: propagate property numerical statistics Feb 2, 2026
@AndreaBozzo AndreaBozzo requested a review from Copilot February 2, 2026 14:41
@AndreaBozzo AndreaBozzo self-assigned this Feb 2, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR propagates advanced numerical statistics (median, quartiles, skewness, kurtosis, coefficient of variation, mode, variance) from the core calculate_numeric_stats function to all profiling engines (Arrow columnar, streaming incremental, and streaming mapped). The changes also expose these statistics in Python bindings and update HTML/JSON output rendering.

Changes:

  • Refactored streaming engines (incremental and mapped) to delegate numeric statistics calculation to the centralized calculate_numeric_stats function instead of computing them inline
  • Updated Arrow columnar engines to use calculate_numeric_stats and increased sample size cap from 100 to 10,000 to match the threshold used by the stats module
  • Extended Python bindings (PyColumnProfile) to expose all new numeric statistics fields
  • Enhanced HTML templates and Python HTML generation to display advanced numeric statistics
  • Added comprehensive test coverage for the new statistics in streaming and columnar engines
  • Increased max_sample_size in StreamingStatistics from 1,000 to 10,000 to align with the numeric stats module

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/python/types.rs Added numeric statistics fields to PyColumnProfile and updated JSON serialization
src/output/html.rs Updated format_column_stats_json to include all new numeric statistics fields
templates/single_report.hbs Added HTML rendering for advanced numeric statistics (median, skewness, kurtosis, CV)
src/engines/streaming/mapped.rs Refactored to use calculate_numeric_stats instead of inline calculations
src/engines/streaming/incremental.rs Refactored to use calculate_numeric_stats and added test coverage
src/engines/columnar/record_batch_analyzer.rs Updated to use calculate_numeric_stats and increased sample cap to 10,000
src/engines/columnar/arrow_profiler.rs Updated to use calculate_numeric_stats and increased sample cap to 10,000
src/core/streaming_stats.rs Increased max_sample_size from 1,000 to 10,000
python/dataprof/init.py Added numeric statistics rendering in both JSON and HTML export methods

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@AndreaBozzo AndreaBozzo merged commit 9761c2f into master Feb 2, 2026
15 checks passed
@AndreaBozzo AndreaBozzo deleted the ab/prop-num-stats branch February 2, 2026 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant