feat(native): Implement read support for Iceberg row lineage by Joe-Abraham · Pull Request #27743 · prestodb/presto

Joe-Abraham · 2026-05-07T14:34:20Z

Description

Adds end-to-end read support for the Iceberg V3 row lineage hidden metadata columns _row_id and _last_updated_sequence_number in the Presto native execution engine.

Motivation and Context

Fixes

native(Iceberg): Implement support for reading row lineage tracking columns added in V3 #27197

Impact

Adds Iceberg v3 row lineage read support in native engine.

Test Plan

Tests added in this PR

Contributor checklist

Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.
If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

Iceberg Connector Changes
* Adds read support of row lineage columns as per Iceberg V3 spec.

Summary by Sourcery

Add native execution engine support for reading Iceberg V3 row lineage metadata columns and validate behavior across table versions and upgrades.

New Features:

Support reading Iceberg row lineage metadata columns (_row_id and _last_updated_sequence_number) in the Presto native execution engine when requested in queries.

Enhancements:

Propagate Iceberg row lineage information from splits into Velox info columns, including conditional handling for V3 versus earlier table versions.

Build:

Add Iceberg core/api/parquet and related Hadoop/Parquet test-scoped dependencies required for native Iceberg row lineage tests.

Tests:

Add comprehensive native Iceberg row-lineage tests covering V3 tables, multiple rows per commit, and V2-to-V3 upgrade backfill semantics.
Update existing native Iceberg hidden metadata tests to explicitly use format-version 3 tables where appropriate.

sourcery-ai · 2026-05-07T14:34:55Z

Reviewer's Guide

Implements end-to-end native read support for Iceberg v3 row lineage metadata columns _row_id and _last_updated_sequence_number in the Presto native engine by wiring lineage metadata through table handles and splits into Velox, tightening V3 semantics in existing tests, and adding new Iceberg/Parquet test dependencies plus comprehensive native integration tests that validate lineage behavior and V2→V3 upgrade scenarios against Iceberg metadata.

Sequence diagram for native Iceberg row lineage read path

sequenceDiagram
    participant PrestoEngine
    participant IcebergPrestoToVeloxConnector
    participant VeloxReader

    PrestoEngine->>IcebergPrestoToVeloxConnector: toIcebergTableHandle
    IcebergPrestoToVeloxConnector->>IcebergPrestoToVeloxConnector: fieldNamesToLowerCase
    IcebergPrestoToVeloxConnector->>IcebergPrestoToVeloxConnector: build finalDataColumns
    alt [columnHandles include _row_id or _last_updated_sequence_number]
        IcebergPrestoToVeloxConnector->>IcebergPrestoToVeloxConnector: add _row_id
        IcebergPrestoToVeloxConnector->>IcebergPrestoToVeloxConnector: add _last_updated_sequence_number
    end
    IcebergPrestoToVeloxConnector-->>PrestoEngine: ConnectorTableHandle

    PrestoEngine->>IcebergPrestoToVeloxConnector: toVeloxSplit
    IcebergPrestoToVeloxConnector->>IcebergPrestoToVeloxConnector: build infoColumns
    alt [icebergSplit->firstRowId >= 0]
        IcebergPrestoToVeloxConnector->>IcebergPrestoToVeloxConnector: set kDataSequenceNumberInfoColumn
        IcebergPrestoToVeloxConnector->>IcebergPrestoToVeloxConnector: set kFirstRowIdInfoColumn
    else [V1/V2 or missing row lineage]
        IcebergPrestoToVeloxConnector->>IcebergPrestoToVeloxConnector: omit row lineage infoColumns
    end
    IcebergPrestoToVeloxConnector-->>PrestoEngine: HiveIcebergSplit

    PrestoEngine->>VeloxReader: HiveIcebergSplit
    VeloxReader-->>PrestoEngine: _row_id, _last_updated_sequence_number

File-Level Changes

Change	Details	Files
Wire Iceberg row lineage metadata columns into the native Iceberg connector table handle and split info so Velox readers can project `_row_id` and `_last_updated_sequence_number`.	Extend Iceberg table handle construction to append row lineage metadata columns to the projected data columns when they are requested but not present in the table schema, using BIGINT types and Iceberg metadata column constants. Update Iceberg split conversion to populate Velox info columns for data sequence number and first row id only when `firstRowId` is available (V3 or backfilled manifests), otherwise defaulting to NULL lineage columns. Ensure legacy V1/V2 tables surface NULL values for lineage metadata until firstRowId/dataSequenceNumber is available, matching Iceberg spec semantics.	`presto-native-execution/presto_cpp/main/connectors/IcebergPrestoToVeloxConnector.cpp`
Add Iceberg and Parquet test-scoped dependencies required to create and manipulate Iceberg tables directly from native worker tests for row lineage validation.	Introduce `iceberg-core`, `iceberg-api`, and `iceberg-parquet` test dependencies aligned to the existing `${dep.iceberg.version}` with exclusions for bundled Guava, Parquet-Avro, and SLF4J clashes. Add `hadoop-apache` and `parquet-column` test dependencies (scoped to test) with minimal exclusions to support direct Iceberg HadoopCatalog and Parquet writer usage in tests.	`presto-native-execution/pom.xml`
Align existing native Iceberg hidden metadata tests with Iceberg V3 semantics by creating test tables as format-version 3.	Update native Iceberg metadata tests that query hidden metadata columns to create PARQUET tables with `"format-version" = '3'` so data sequence number semantics match the v3 specification. Adjust general native Iceberg hidden-column test setup to create `test_hidden_columns` as a V3 Iceberg table, preserving subsequent inserts.	`presto-native-execution/src/test/java/com/facebook/presto/nativeworker/iceberg/TestMetadata.java` `presto-native-execution/src/test/java/com/facebook/presto/nativeworker/TestPrestoNativeIcebergGeneralQueries.java`
Introduce comprehensive native worker integration tests that validate `_row_id` and `_last_updated_sequence_number` semantics against Iceberg metadata, including V3-only behavior and V2→V3 upgrades.	Add a new native worker test suite that boots both native and Java Iceberg query runners on a HadoopCatalog-backed warehouse, using matching storage format and catalog configuration. Implement helpers to construct Iceberg schemas, create V2/V3 Iceberg tables via the Iceberg Catalog API, and write Parquet data files using Iceberg’s `Parquet.writeData` + `GenericParquetWriter` and append them to tables. Add a test that writes two separate commits into a V3 table, computes expected `(row_id, sequence_number)` pairs from Iceberg `DataFile` metadata, and verifies Presto native results match, with uniqueness and ordering assertions on lineage values. Add a test that writes multiple rows in a single commit to a V3 table and checks that all rows share the same sequence number while still having unique row IDs, matching Iceberg’s file-level sequence semantics. Add a test that writes rows to a V2 table, verifies both lineage columns are NULL, upgrades the table to V3 (triggering firstRowId backfill), then writes more rows and validates that all rows (pre- and post-upgrade) have non-null lineage that exactly matches Iceberg metadata. Provide utility methods for loading the Iceberg Hadoop catalog, resolving the warehouse directory from the distributed query runner, and building the expected lineage pairs from planned file scans.	`presto-native-execution/src/test/java/com/facebook/presto/nativeworker/TestPrestoNativeIcebergRowLineage.java`

Possibly linked issues

native(Iceberg): Implement support for reading row lineage tracking columns added in V3 #27197: The PR implements native read support for Iceberg V3 row lineage columns exactly as requested in the issue.
#: PR implements the native row lineage read support explicitly listed under this Iceberg V3 tracking issue.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

Joe-Abraham · 2026-05-11T16:33:08Z

@sourcery-ai review

sourcery-ai

Hey - I've found 2 issues, and left some high level feedback:

In toIcebergTableHandle, the iteration over columnHandles with std::find on the names vector is O(n^2); consider tracking names in a std::unordered_set (or similar) to avoid repeated linear scans when more columns are involved.
The test class TestPrestoNativeIcebergRowLineage repeats logic to build expectedPairs from FileScanTask/DataFile in multiple tests; consider extracting this into a helper method to reduce duplication and keep the assertions focused on behavior differences.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `toIcebergTableHandle`, the iteration over `columnHandles` with `std::find` on the `names` vector is O(n^2); consider tracking names in a `std::unordered_set` (or similar) to avoid repeated linear scans when more columns are involved.
- The test class `TestPrestoNativeIcebergRowLineage` repeats logic to build `expectedPairs` from `FileScanTask`/`DataFile` in multiple tests; consider extracting this into a helper method to reduce duplication and keep the assertions focused on behavior differences.

## Individual Comments

### Comment 1
<location path="presto-native-execution/pom.xml" line_range="249-251" />
<code_context>
             </exclusions>
         </dependency>

+        <dependency>
+            <groupId>org.apache.iceberg</groupId>
+            <artifactId>iceberg-core</artifactId>
+            <version>${dep.iceberg.version}</version>
+            <scope>test</scope>
</code_context>
<issue_to_address>
**issue (bug_risk):** Consider excluding `iceberg-bundled-guava` from `iceberg-core` as well to avoid guava conflicts

`iceberg-core` also brings in `iceberg-bundled-guava`, so leaving it unexcluded can cause classpath conflicts with the rest of the stack. Please apply the same exclusion here so tests use the shared Guava version instead of the shaded one.
</issue_to_address>

### Comment 2
<location path="presto-native-execution/src/test/java/com/facebook/presto/nativeworker/TestPrestoNativeIcebergRowLineage.java" line_range="248-257" />
<code_context>
+            // Per the Iceberg spec, _row_id is null for pre-V3 snapshots (first_row_id unset),
</code_context>
<issue_to_address>
**suggestion (testing):** Strengthen pre-upgrade V2 assertions by checking `_last_updated_sequence_number` against Iceberg metadata

In `testV2TableRowLineageNullThenBackfilledAfterV3Upgrade`, the pre-upgrade checks cover `_row_id` being NULL and `_last_updated_sequence_number` being non-null, but they don’t validate that these sequence values match the Iceberg `dataSequenceNumber` per data file. To make this a more robust end-to-end test, please also capture the pre-upgrade `dataSequenceNumber` for each file and assert that the values returned by Presto for `_last_updated_sequence_number` align with those, ensuring the native engine correctly inherits sequence numbers from split metadata when `firstRowId` is unset.

Suggested implementation:

```java
            // Per the Iceberg spec, _row_id is null for pre-V3 snapshots (first_row_id unset),
            // but _last_updated_sequence_number is always inherited from the data sequence number.
            assertQuery("SELECT \"_row_id\", \"_last_updated_sequence_number\", * FROM " + tableName);
            assertEquals(computeActual("SELECT \"_row_id\", * FROM " + tableName).getRowCount(), 3);
            assertQuery("SELECT \"_row_id\" FROM " + tableName + " ORDER BY id", "VALUES NULL, NULL, NULL");
            assertEquals(
                    computeActual("SELECT count(*) FROM " + tableName +
                            " WHERE \"_last_updated_sequence_number\" IS NULL").getOnlyValue(),
                    0L,
                    "_last_updated_sequence_number must be non-null for V2 rows (inherited from data sequence number)");

            // Strengthen the pre-upgrade V2 assertions by verifying that the values returned by Presto for
            // _last_updated_sequence_number match the Iceberg dataSequenceNumber values for the underlying data files.
            //
            // In this test:
            //  - The first write produces a file containing rows with id = 1, 2
            //  - The second write produces a file containing the row with id = 3
            //
            // For V2 tables, the dataSequenceNumber is stored per data file and should be inherited by all rows
            // read from that file when firstRowId is unset (pre-V3 snapshots).
            Snapshot snapshot = table.currentSnapshot();
            List<Long> dataSequenceNumbers = Streams.stream(SnapshotUtil.snapshotDataFiles(table, snapshot.snapshotId()))
                    .map(DataFile::dataSequenceNumber)
                    .distinct()
                    .sorted()
                    .collect(toImmutableList());

            assertEquals(
                    dataSequenceNumbers.size(),
                    2,
                    "Expected two distinct data sequence numbers for the two data files written in the test");

            MaterializedResult v2Rows = computeActual(
                    "SELECT id, \"_last_updated_sequence_number\" FROM " + tableName + " ORDER BY id");

            List<Long> lastUpdatedSequenceNumbers = v2Rows.getMaterializedRows().stream()
                    .map(row -> ((Number) row.getField(1)).longValue())
                    .collect(toImmutableList());

            assertEquals(lastUpdatedSequenceNumbers.size(), 3, "Expected three V2 rows");

            // Rows with id = 1 and 2 come from the first file, and id = 3 from the second file. Validate that
            // the _last_updated_sequence_number exported by the native engine matches the Iceberg dataSequenceNumber
            // for the corresponding data file.
            assertEquals(
                    lastUpdatedSequenceNumbers.get(0),
                    dataSequenceNumbers.get(0),
                    "Row with id = 1 should inherit the dataSequenceNumber of the first data file");
            assertEquals(
                    lastUpdatedSequenceNumbers.get(1),
                    dataSequenceNumbers.get(0),
                    "Row with id = 2 should inherit the dataSequenceNumber of the first data file");
            assertEquals(
                    lastUpdatedSequenceNumbers.get(2),
                    dataSequenceNumbers.get(1),
                    "Row with id = 3 should inherit the dataSequenceNumber of the second data file");

```

To compile successfully, the following imports (or their equivalents, depending on what is already used in this test class) need to be added at the top of `TestPrestoNativeIcebergRowLineage.java`:

1. Import Iceberg metadata utilities and data file type:
   - `import org.apache.iceberg.Snapshot;`
   - `import org.apache.iceberg.SnapshotUtil;`
   - `import org.apache.iceberg.DataFile;`

2. Import streaming/collection utilities:
   - `import java.util.List;`
   - `import com.google.common.collect.Streams;`
   - `import static com.google.common.collect.ImmutableList.toImmutableList;`

If the file already uses `MaterializedResult` and `toImmutableList`, re-use the existing imports instead of duplicating them.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

Joe-Abraham · 2026-05-12T05:51:20Z

@sourcery-ai review

Joe-Abraham · 2026-05-12T05:51:42Z

@sourcery-ai summary

Joe-Abraham · 2026-05-12T05:51:55Z

@sourcery-ai guide

sourcery-ai

Hey - I've left some high level feedback:

In IcebergPrestoToVeloxConnector::toVeloxSplit, now that the data sequence and first row id info column keys use IcebergMetadataColumn constants, consider also replacing the hardcoded "$path" key with the corresponding constant (if available) to keep these info column names centralized and less error-prone.
The repeated use of the literal "format-version" and string values '2'/'3' across the new and modified tests could be refactored into shared constants or helper methods to reduce duplication and make future Iceberg version-related changes easier.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `IcebergPrestoToVeloxConnector::toVeloxSplit`, now that the data sequence and first row id info column keys use `IcebergMetadataColumn` constants, consider also replacing the hardcoded `"$path"` key with the corresponding constant (if available) to keep these info column names centralized and less error-prone.
- The repeated use of the literal `"format-version"` and string values `'2'`/`'3'` across the new and modified tests could be refactored into shared constants or helper methods to reduce duplication and make future Iceberg version-related changes easier.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

prestodb-ci added the from:IBM PR from IBM label May 7, 2026

Joe-Abraham force-pushed the row_lineage_c++ branch 2 times, most recently from 6791f23 to 1240978 Compare May 11, 2026 06:16

Joe-Abraham changed the title ~~feat: Implement read support for Iceberg row lineage~~ feat(native): Implement read support for Iceberg row lineage May 11, 2026

This comment was marked as resolved.

Sign in to view

chore: Update velox

87fe850

Joe-Abraham force-pushed the row_lineage_c++ branch from 2fe0691 to fe35fd2 Compare May 11, 2026 08:56

This comment was marked as resolved.

Sign in to view

Joe-Abraham force-pushed the row_lineage_c++ branch from fe35fd2 to 1b960c1 Compare May 11, 2026 09:13

Joe-Abraham mentioned this pull request May 11, 2026

[native]: Iceberg V3 support #27198

Open

27 tasks

Joe-Abraham force-pushed the row_lineage_c++ branch 3 times, most recently from f0c9da2 to 996b9c2 Compare May 11, 2026 09:24

prestodb deleted a comment from sourcery-ai Bot May 11, 2026

Joe-Abraham force-pushed the row_lineage_c++ branch 2 times, most recently from f6035a7 to b1f349f Compare May 11, 2026 15:41

sourcery-ai Bot reviewed May 11, 2026

View reviewed changes

Comment thread presto-native-execution/pom.xml

Comment thread ...cution/src/test/java/com/facebook/presto/nativeworker/TestPrestoNativeIcebergRowLineage.java Outdated

feat: Implement row lineage support for Iceberg

6861fb9

Joe-Abraham force-pushed the row_lineage_c++ branch from b1f349f to 6861fb9 Compare May 12, 2026 05:26

sourcery-ai Bot reviewed May 12, 2026

View reviewed changes

Joe-Abraham marked this pull request as ready for review May 12, 2026 07:47

Joe-Abraham requested review from a team as code owners May 12, 2026 07:47

prestodb-ci requested review from a team and removed request for a team May 12, 2026 07:47

prestodb-ci requested review from NivinCS and jp-sivaprasad May 12, 2026 07:47

Joe-Abraham requested a review from nmahadevuni May 12, 2026 07:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(native): Implement read support for Iceberg row lineage#27743

feat(native): Implement read support for Iceberg row lineage#27743
Joe-Abraham wants to merge 2 commits into
prestodb:masterfrom
Joe-Abraham:row_lineage_c++

Joe-Abraham commented May 7, 2026 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot commented May 7, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Joe-Abraham commented May 11, 2026

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Joe-Abraham commented May 12, 2026

Uh oh!

Joe-Abraham commented May 12, 2026

Uh oh!

Joe-Abraham commented May 12, 2026

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Joe-Abraham commented May 7, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for native Iceberg row lineage read path

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Joe-Abraham commented May 11, 2026

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Joe-Abraham commented May 12, 2026

Uh oh!

Joe-Abraham commented May 12, 2026

Uh oh!

Joe-Abraham commented May 12, 2026

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Joe-Abraham commented May 7, 2026 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented May 7, 2026 •

edited

Loading