Skip to content

[v26.1.x] iceberg: Push Parquest column stats to Iceberg manifests#30777

Open
vbotbuildovich wants to merge 2 commits into
redpanda-data:v26.1.xfrom
vbotbuildovich:ai-backport-pr-30704-v26.1.x-1781219179
Open

[v26.1.x] iceberg: Push Parquest column stats to Iceberg manifests#30777
vbotbuildovich wants to merge 2 commits into
redpanda-data:v26.1.xfrom
vbotbuildovich:ai-backport-pr-30704-v26.1.x-1781219179

Conversation

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

Backport of PR #30704

  • Command: git cherry-pick -x 7f83353 b4e8232
  • Commits backported: 2
  • Conflicts resolved: 1
  • Commits skipped (already on target): 0
  • Backport branch: ai-backport-pr-30704-v26.1.x-1781219179

Conflict details

  • 7f83353 (src/v/serde/parquet/writer.cc): only the two new file-level stat
    accumulation lines (file_value_count/file_column_size_bytes) were applied;
    the incoming bloom-filter block is a separate feature not present on v26.1.x
    and was omitted.
  • 7f83353 (src/v/serde/parquet/column_writer.cc): the commit was built atop a
    stats-truncation feature (truncate_max/truncate_min,
    max_stats_truncate_length, is_utf8_string) that does not exist on v26.1.x.
    Adapted build_statistics() and flush_page() to the target's non-truncating
    bound logic, dropped the redundant _flushed_stats.record_value() calls in
    favor of merge(), added the _file_stats collector, and omitted the
    unrelated _bloom_filter member.
  • 7f83353 (src/v/datalake/base_types.h): merged the new includes needed by
    per_column_stats (bytes, chunked_vector, serde envelope/rw); omitted
    base/format_to.h since v26.1.x's local_file_metadata still uses
    operator<< rather than format_to.
  • 7f83353 (src/v/datalake/BUILD): merged the new base_types deps
    (base, bytes, container:chunked_vector, serde, serde:bytes) into the target's
    deps list.
  • 7f83353 (src/v/datalake/coordinator/BUILD): kept both the target's
    //src/v/base dep and the incoming //src/v/bytes dep in
    iceberg_file_committer's implementation_deps.

Thread per-column stats (min/max bounds, null counts, value counts,
column sizes) through to the Iceberg data_file manifest entry, where
query engines use them for column-level predicate pushdown.

Maintain a file-level column_stats_collector in buffered_column_writer
that accumulates by merging after each flush_pages(), then use the result
after the file is done to get file-level stats.

(cherry picked from commit 7f83353)
@vbotbuildovich vbotbuildovich added this to the v26.1.x-next milestone Jun 11, 2026
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/build area/redpanda kind/backport PRs targeting a stable branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants