Skip to content

feat(native): Implement read support for Iceberg row lineage#27743

Open
Joe-Abraham wants to merge 2 commits into
prestodb:masterfrom
Joe-Abraham:row_lineage_c++
Open

feat(native): Implement read support for Iceberg row lineage#27743
Joe-Abraham wants to merge 2 commits into
prestodb:masterfrom
Joe-Abraham:row_lineage_c++

Conversation

@Joe-Abraham
Copy link
Copy Markdown
Contributor

@Joe-Abraham Joe-Abraham commented May 7, 2026

Description

Adds end-to-end read support for the Iceberg V3 row lineage hidden metadata columns _row_id and _last_updated_sequence_number in the Presto native execution engine.

Motivation and Context

Fixes

Impact

Adds Iceberg v3 row lineage read support in native engine.

Test Plan

Tests added in this PR

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.
  • If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

Iceberg Connector Changes
* Adds read support of row lineage columns as per Iceberg V3 spec.

Summary by Sourcery

Add native execution engine support for reading Iceberg V3 row lineage metadata columns and validate behavior across table versions and upgrades.

New Features:

  • Support reading Iceberg row lineage metadata columns (_row_id and _last_updated_sequence_number) in the Presto native execution engine when requested in queries.

Enhancements:

  • Propagate Iceberg row lineage information from splits into Velox info columns, including conditional handling for V3 versus earlier table versions.

Build:

  • Add Iceberg core/api/parquet and related Hadoop/Parquet test-scoped dependencies required for native Iceberg row lineage tests.

Tests:

  • Add comprehensive native Iceberg row-lineage tests covering V3 tables, multiple rows per commit, and V2-to-V3 upgrade backfill semantics.
  • Update existing native Iceberg hidden metadata tests to explicitly use format-version 3 tables where appropriate.

@prestodb-ci prestodb-ci added the from:IBM PR from IBM label May 7, 2026
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented May 7, 2026

Reviewer's Guide

Implements end-to-end native read support for Iceberg v3 row lineage metadata columns _row_id and _last_updated_sequence_number in the Presto native engine by wiring lineage metadata through table handles and splits into Velox, tightening V3 semantics in existing tests, and adding new Iceberg/Parquet test dependencies plus comprehensive native integration tests that validate lineage behavior and V2→V3 upgrade scenarios against Iceberg metadata.

Sequence diagram for native Iceberg row lineage read path

sequenceDiagram
    participant PrestoEngine
    participant IcebergPrestoToVeloxConnector
    participant VeloxReader

    PrestoEngine->>IcebergPrestoToVeloxConnector: toIcebergTableHandle
    IcebergPrestoToVeloxConnector->>IcebergPrestoToVeloxConnector: fieldNamesToLowerCase
    IcebergPrestoToVeloxConnector->>IcebergPrestoToVeloxConnector: build finalDataColumns
    alt [columnHandles include _row_id or _last_updated_sequence_number]
        IcebergPrestoToVeloxConnector->>IcebergPrestoToVeloxConnector: add _row_id
        IcebergPrestoToVeloxConnector->>IcebergPrestoToVeloxConnector: add _last_updated_sequence_number
    end
    IcebergPrestoToVeloxConnector-->>PrestoEngine: ConnectorTableHandle

    PrestoEngine->>IcebergPrestoToVeloxConnector: toVeloxSplit
    IcebergPrestoToVeloxConnector->>IcebergPrestoToVeloxConnector: build infoColumns
    alt [icebergSplit->firstRowId >= 0]
        IcebergPrestoToVeloxConnector->>IcebergPrestoToVeloxConnector: set kDataSequenceNumberInfoColumn
        IcebergPrestoToVeloxConnector->>IcebergPrestoToVeloxConnector: set kFirstRowIdInfoColumn
    else [V1/V2 or missing row lineage]
        IcebergPrestoToVeloxConnector->>IcebergPrestoToVeloxConnector: omit row lineage infoColumns
    end
    IcebergPrestoToVeloxConnector-->>PrestoEngine: HiveIcebergSplit

    PrestoEngine->>VeloxReader: HiveIcebergSplit
    VeloxReader-->>PrestoEngine: _row_id, _last_updated_sequence_number
Loading

File-Level Changes

Change Details Files
Wire Iceberg row lineage metadata columns into the native Iceberg connector table handle and split info so Velox readers can project _row_id and _last_updated_sequence_number.
  • Extend Iceberg table handle construction to append row lineage metadata columns to the projected data columns when they are requested but not present in the table schema, using BIGINT types and Iceberg metadata column constants.
  • Update Iceberg split conversion to populate Velox info columns for data sequence number and first row id only when firstRowId is available (V3 or backfilled manifests), otherwise defaulting to NULL lineage columns.
  • Ensure legacy V1/V2 tables surface NULL values for lineage metadata until firstRowId/dataSequenceNumber is available, matching Iceberg spec semantics.
presto-native-execution/presto_cpp/main/connectors/IcebergPrestoToVeloxConnector.cpp
Add Iceberg and Parquet test-scoped dependencies required to create and manipulate Iceberg tables directly from native worker tests for row lineage validation.
  • Introduce iceberg-core, iceberg-api, and iceberg-parquet test dependencies aligned to the existing ${dep.iceberg.version} with exclusions for bundled Guava, Parquet-Avro, and SLF4J clashes.
  • Add hadoop-apache and parquet-column test dependencies (scoped to test) with minimal exclusions to support direct Iceberg HadoopCatalog and Parquet writer usage in tests.
presto-native-execution/pom.xml
Align existing native Iceberg hidden metadata tests with Iceberg V3 semantics by creating test tables as format-version 3.
  • Update native Iceberg metadata tests that query hidden metadata columns to create PARQUET tables with "format-version" = '3' so data sequence number semantics match the v3 specification.
  • Adjust general native Iceberg hidden-column test setup to create test_hidden_columns as a V3 Iceberg table, preserving subsequent inserts.
presto-native-execution/src/test/java/com/facebook/presto/nativeworker/iceberg/TestMetadata.java
presto-native-execution/src/test/java/com/facebook/presto/nativeworker/TestPrestoNativeIcebergGeneralQueries.java
Introduce comprehensive native worker integration tests that validate _row_id and _last_updated_sequence_number semantics against Iceberg metadata, including V3-only behavior and V2→V3 upgrades.
  • Add a new native worker test suite that boots both native and Java Iceberg query runners on a HadoopCatalog-backed warehouse, using matching storage format and catalog configuration.
  • Implement helpers to construct Iceberg schemas, create V2/V3 Iceberg tables via the Iceberg Catalog API, and write Parquet data files using Iceberg’s Parquet.writeData + GenericParquetWriter and append them to tables.
  • Add a test that writes two separate commits into a V3 table, computes expected (row_id, sequence_number) pairs from Iceberg DataFile metadata, and verifies Presto native results match, with uniqueness and ordering assertions on lineage values.
  • Add a test that writes multiple rows in a single commit to a V3 table and checks that all rows share the same sequence number while still having unique row IDs, matching Iceberg’s file-level sequence semantics.
  • Add a test that writes rows to a V2 table, verifies both lineage columns are NULL, upgrades the table to V3 (triggering firstRowId backfill), then writes more rows and validates that all rows (pre- and post-upgrade) have non-null lineage that exactly matches Iceberg metadata.
  • Provide utility methods for loading the Iceberg Hadoop catalog, resolving the warehouse directory from the distributed query runner, and building the expected lineage pairs from planned file scans.
presto-native-execution/src/test/java/com/facebook/presto/nativeworker/TestPrestoNativeIcebergRowLineage.java

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@Joe-Abraham Joe-Abraham force-pushed the row_lineage_c++ branch 2 times, most recently from 6791f23 to 1240978 Compare May 11, 2026 06:16
@Joe-Abraham Joe-Abraham changed the title feat: Implement read support for Iceberg row lineage feat(native): Implement read support for Iceberg row lineage May 11, 2026
sourcery-ai[bot]

This comment was marked as resolved.

sourcery-ai[bot]

This comment was marked as resolved.

@Joe-Abraham Joe-Abraham mentioned this pull request May 11, 2026
27 tasks
@Joe-Abraham Joe-Abraham force-pushed the row_lineage_c++ branch 3 times, most recently from f0c9da2 to 996b9c2 Compare May 11, 2026 09:24
@prestodb prestodb deleted a comment from sourcery-ai Bot May 11, 2026
@Joe-Abraham Joe-Abraham force-pushed the row_lineage_c++ branch 2 times, most recently from f6035a7 to b1f349f Compare May 11, 2026 15:41
@Joe-Abraham
Copy link
Copy Markdown
Contributor Author

@sourcery-ai review

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • In toIcebergTableHandle, the iteration over columnHandles with std::find on the names vector is O(n^2); consider tracking names in a std::unordered_set (or similar) to avoid repeated linear scans when more columns are involved.
  • The test class TestPrestoNativeIcebergRowLineage repeats logic to build expectedPairs from FileScanTask/DataFile in multiple tests; consider extracting this into a helper method to reduce duplication and keep the assertions focused on behavior differences.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `toIcebergTableHandle`, the iteration over `columnHandles` with `std::find` on the `names` vector is O(n^2); consider tracking names in a `std::unordered_set` (or similar) to avoid repeated linear scans when more columns are involved.
- The test class `TestPrestoNativeIcebergRowLineage` repeats logic to build `expectedPairs` from `FileScanTask`/`DataFile` in multiple tests; consider extracting this into a helper method to reduce duplication and keep the assertions focused on behavior differences.

## Individual Comments

### Comment 1
<location path="presto-native-execution/pom.xml" line_range="249-251" />
<code_context>
             </exclusions>
         </dependency>

+        <dependency>
+            <groupId>org.apache.iceberg</groupId>
+            <artifactId>iceberg-core</artifactId>
+            <version>${dep.iceberg.version}</version>
+            <scope>test</scope>
</code_context>
<issue_to_address>
**issue (bug_risk):** Consider excluding `iceberg-bundled-guava` from `iceberg-core` as well to avoid guava conflicts

`iceberg-core` also brings in `iceberg-bundled-guava`, so leaving it unexcluded can cause classpath conflicts with the rest of the stack. Please apply the same exclusion here so tests use the shared Guava version instead of the shaded one.
</issue_to_address>

### Comment 2
<location path="presto-native-execution/src/test/java/com/facebook/presto/nativeworker/TestPrestoNativeIcebergRowLineage.java" line_range="248-257" />
<code_context>
+            // Per the Iceberg spec, _row_id is null for pre-V3 snapshots (first_row_id unset),
</code_context>
<issue_to_address>
**suggestion (testing):** Strengthen pre-upgrade V2 assertions by checking `_last_updated_sequence_number` against Iceberg metadata

In `testV2TableRowLineageNullThenBackfilledAfterV3Upgrade`, the pre-upgrade checks cover `_row_id` being NULL and `_last_updated_sequence_number` being non-null, but they don’t validate that these sequence values match the Iceberg `dataSequenceNumber` per data file. To make this a more robust end-to-end test, please also capture the pre-upgrade `dataSequenceNumber` for each file and assert that the values returned by Presto for `_last_updated_sequence_number` align with those, ensuring the native engine correctly inherits sequence numbers from split metadata when `firstRowId` is unset.

Suggested implementation:

```java
            // Per the Iceberg spec, _row_id is null for pre-V3 snapshots (first_row_id unset),
            // but _last_updated_sequence_number is always inherited from the data sequence number.
            assertQuery("SELECT \"_row_id\", \"_last_updated_sequence_number\", * FROM " + tableName);
            assertEquals(computeActual("SELECT \"_row_id\", * FROM " + tableName).getRowCount(), 3);
            assertQuery("SELECT \"_row_id\" FROM " + tableName + " ORDER BY id", "VALUES NULL, NULL, NULL");
            assertEquals(
                    computeActual("SELECT count(*) FROM " + tableName +
                            " WHERE \"_last_updated_sequence_number\" IS NULL").getOnlyValue(),
                    0L,
                    "_last_updated_sequence_number must be non-null for V2 rows (inherited from data sequence number)");

            // Strengthen the pre-upgrade V2 assertions by verifying that the values returned by Presto for
            // _last_updated_sequence_number match the Iceberg dataSequenceNumber values for the underlying data files.
            //
            // In this test:
            //  - The first write produces a file containing rows with id = 1, 2
            //  - The second write produces a file containing the row with id = 3
            //
            // For V2 tables, the dataSequenceNumber is stored per data file and should be inherited by all rows
            // read from that file when firstRowId is unset (pre-V3 snapshots).
            Snapshot snapshot = table.currentSnapshot();
            List<Long> dataSequenceNumbers = Streams.stream(SnapshotUtil.snapshotDataFiles(table, snapshot.snapshotId()))
                    .map(DataFile::dataSequenceNumber)
                    .distinct()
                    .sorted()
                    .collect(toImmutableList());

            assertEquals(
                    dataSequenceNumbers.size(),
                    2,
                    "Expected two distinct data sequence numbers for the two data files written in the test");

            MaterializedResult v2Rows = computeActual(
                    "SELECT id, \"_last_updated_sequence_number\" FROM " + tableName + " ORDER BY id");

            List<Long> lastUpdatedSequenceNumbers = v2Rows.getMaterializedRows().stream()
                    .map(row -> ((Number) row.getField(1)).longValue())
                    .collect(toImmutableList());

            assertEquals(lastUpdatedSequenceNumbers.size(), 3, "Expected three V2 rows");

            // Rows with id = 1 and 2 come from the first file, and id = 3 from the second file. Validate that
            // the _last_updated_sequence_number exported by the native engine matches the Iceberg dataSequenceNumber
            // for the corresponding data file.
            assertEquals(
                    lastUpdatedSequenceNumbers.get(0),
                    dataSequenceNumbers.get(0),
                    "Row with id = 1 should inherit the dataSequenceNumber of the first data file");
            assertEquals(
                    lastUpdatedSequenceNumbers.get(1),
                    dataSequenceNumbers.get(0),
                    "Row with id = 2 should inherit the dataSequenceNumber of the first data file");
            assertEquals(
                    lastUpdatedSequenceNumbers.get(2),
                    dataSequenceNumbers.get(1),
                    "Row with id = 3 should inherit the dataSequenceNumber of the second data file");

```

To compile successfully, the following imports (or their equivalents, depending on what is already used in this test class) need to be added at the top of `TestPrestoNativeIcebergRowLineage.java`:

1. Import Iceberg metadata utilities and data file type:
   - `import org.apache.iceberg.Snapshot;`
   - `import org.apache.iceberg.SnapshotUtil;`
   - `import org.apache.iceberg.DataFile;`

2. Import streaming/collection utilities:
   - `import java.util.List;`
   - `import com.google.common.collect.Streams;`
   - `import static com.google.common.collect.ImmutableList.toImmutableList;`

If the file already uses `MaterializedResult` and `toImmutableList`, re-use the existing imports instead of duplicating them.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread presto-native-execution/pom.xml
@Joe-Abraham
Copy link
Copy Markdown
Contributor Author

@sourcery-ai review

@Joe-Abraham
Copy link
Copy Markdown
Contributor Author

@sourcery-ai summary

@Joe-Abraham
Copy link
Copy Markdown
Contributor Author

@sourcery-ai guide

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • In IcebergPrestoToVeloxConnector::toVeloxSplit, now that the data sequence and first row id info column keys use IcebergMetadataColumn constants, consider also replacing the hardcoded "$path" key with the corresponding constant (if available) to keep these info column names centralized and less error-prone.
  • The repeated use of the literal "format-version" and string values '2'/'3' across the new and modified tests could be refactored into shared constants or helper methods to reduce duplication and make future Iceberg version-related changes easier.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `IcebergPrestoToVeloxConnector::toVeloxSplit`, now that the data sequence and first row id info column keys use `IcebergMetadataColumn` constants, consider also replacing the hardcoded `"$path"` key with the corresponding constant (if available) to keep these info column names centralized and less error-prone.
- The repeated use of the literal `"format-version"` and string values `'2'`/`'3'` across the new and modified tests could be refactored into shared constants or helper methods to reduce duplication and make future Iceberg version-related changes easier.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@Joe-Abraham Joe-Abraham marked this pull request as ready for review May 12, 2026 07:47
@Joe-Abraham Joe-Abraham requested review from a team as code owners May 12, 2026 07:47
@prestodb-ci prestodb-ci requested review from a team and removed request for a team May 12, 2026 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:IBM PR from IBM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants