[v26.1.x] iceberg: Push Parquest column stats to Iceberg manifests by vbotbuildovich · Pull Request #30777 · redpanda-data/redpanda

vbotbuildovich · 2026-06-11T23:12:24Z

Backport of PR #30704

Command: git cherry-pick -x 7f83353 b4e8232
Commits backported: 2
Conflicts resolved: 1
Commits skipped (already on target): 0
Backport branch: ai-backport-pr-30704-v26.1.x-1781219179

Conflict details

7f83353 (src/v/serde/parquet/writer.cc): only the two new file-level stat
accumulation lines (file_value_count/file_column_size_bytes) were applied;
the incoming bloom-filter block is a separate feature not present on v26.1.x
and was omitted.
7f83353 (src/v/serde/parquet/column_writer.cc): the commit was built atop a
stats-truncation feature (truncate_max/truncate_min,
max_stats_truncate_length, is_utf8_string) that does not exist on v26.1.x.
Adapted build_statistics() and flush_page() to the target's non-truncating
bound logic, dropped the redundant _flushed_stats.record_value() calls in
favor of merge(), added the _file_stats collector, and omitted the
unrelated _bloom_filter member.
7f83353 (src/v/datalake/base_types.h): merged the new includes needed by
per_column_stats (bytes, chunked_vector, serde envelope/rw); omitted
base/format_to.h since v26.1.x's local_file_metadata still uses
operator<< rather than format_to.
7f83353 (src/v/datalake/BUILD): merged the new base_types deps
(base, bytes, container:chunked_vector, serde, serde:bytes) into the target's
deps list.
7f83353 (src/v/datalake/coordinator/BUILD): kept both the target's
//src/v/base dep and the incoming //src/v/bytes dep in
iceberg_file_committer's implementation_deps.

Thread per-column stats (min/max bounds, null counts, value counts, column sizes) through to the Iceberg data_file manifest entry, where query engines use them for column-level predicate pushdown. Maintain a file-level column_stats_collector in buffered_column_writer that accumulates by merging after each flush_pages(), then use the result after the file is done to get file-level stats. (cherry picked from commit 7f83353)

(cherry picked from commit b4e8232)

wdberkeley added 2 commits June 11, 2026 23:11

rptest/datalake: test Iceberg column stats in manifest

1533acf

(cherry picked from commit b4e8232)

vbotbuildovich added this to the v26.1.x-next milestone Jun 11, 2026

vbotbuildovich added the kind/backport PRs targeting a stable branch label Jun 11, 2026

vbotbuildovich requested a review from wdberkeley June 11, 2026 23:12

github-actions Bot added area/build area/redpanda labels Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v26.1.x] iceberg: Push Parquest column stats to Iceberg manifests#30777

[v26.1.x] iceberg: Push Parquest column stats to Iceberg manifests#30777
vbotbuildovich wants to merge 2 commits into
redpanda-data:v26.1.xfrom
vbotbuildovich:ai-backport-pr-30704-v26.1.x-1781219179

vbotbuildovich commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vbotbuildovich commented Jun 11, 2026

Conflict details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants