feat: propagate property numerical statistics#204
Merged
AndreaBozzo merged 2 commits intomasterfrom Feb 2, 2026
Merged
Conversation
…ut, remaining work ported into issue
There was a problem hiding this comment.
Pull request overview
This PR propagates advanced numerical statistics (median, quartiles, skewness, kurtosis, coefficient of variation, mode, variance) from the core calculate_numeric_stats function to all profiling engines (Arrow columnar, streaming incremental, and streaming mapped). The changes also expose these statistics in Python bindings and update HTML/JSON output rendering.
Changes:
- Refactored streaming engines (incremental and mapped) to delegate numeric statistics calculation to the centralized
calculate_numeric_statsfunction instead of computing them inline - Updated Arrow columnar engines to use
calculate_numeric_statsand increased sample size cap from 100 to 10,000 to match the threshold used by the stats module - Extended Python bindings (
PyColumnProfile) to expose all new numeric statistics fields - Enhanced HTML templates and Python HTML generation to display advanced numeric statistics
- Added comprehensive test coverage for the new statistics in streaming and columnar engines
- Increased
max_sample_sizeinStreamingStatisticsfrom 1,000 to 10,000 to align with the numeric stats module
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/python/types.rs | Added numeric statistics fields to PyColumnProfile and updated JSON serialization |
| src/output/html.rs | Updated format_column_stats_json to include all new numeric statistics fields |
| templates/single_report.hbs | Added HTML rendering for advanced numeric statistics (median, skewness, kurtosis, CV) |
| src/engines/streaming/mapped.rs | Refactored to use calculate_numeric_stats instead of inline calculations |
| src/engines/streaming/incremental.rs | Refactored to use calculate_numeric_stats and added test coverage |
| src/engines/columnar/record_batch_analyzer.rs | Updated to use calculate_numeric_stats and increased sample cap to 10,000 |
| src/engines/columnar/arrow_profiler.rs | Updated to use calculate_numeric_stats and increased sample cap to 10,000 |
| src/core/streaming_stats.rs | Increased max_sample_size from 1,000 to 10,000 |
| python/dataprof/init.py | Added numeric statistics rendering in both JSON and HTML export methods |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
propagation of numerical statistics properties.
remaining work ported into issue #203