Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec: Adds Row Lineage #11130

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

RussellSpitzer
Copy link
Member

@RussellSpitzer RussellSpitzer commented Sep 13, 2024

Proposal Here :

https://docs.google.com/document/d/146YuAnU17prnIhyuvbCtCtVSavyd5N7hKryyVRaFDTE/edit#heading=h.f2e8ffw3fu7n

Adds Row Lineage to the Spec

End goal is to provide two fields to all rows

_row_id a unique long which identifies every row added to the table
_last_update the sequence number of the last commit to modify the row

Fixes #11129

@github-actions github-actions bot added the Specification Issues that may introduce spec changes. label Sep 13, 2024

#### Datafile Propagation

New data files added when `row-lineage` is enabled do not require any modification. The columns for `_row_identifiier`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Recommended on the proposal, we don't actually need to include the columns on new files.

format/spec.md Outdated
requirement is controlled by setting the field `row-lineage` to true in the table's metadata. When true, two additional
fields in data files will be available for all rows added to the table.

* `_row_identifier` a unique long for every row. Computed via inheritance for rows in their original datafiles
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possible alternative from the doc was to have this be a combination of a random prefix and integer to remove the requirement of monotonic integer from the metadata. Since we have other monotonic integers in the metadata, I think this may not be that helpful unless we do a broad change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the row lineage concept now. the main goal is to keep the row identifier the same in the case of file rewrites (like compaction, sorting etc.), because those operations don't insert new rows.

With row-level updates on primary key (identifier fields), each update (with new row value) would generate a new row identifier.

If my understanding is correct, this choice of monotonic integer is a bit like row-level sequence number.

Technically, UUID can also work for that purpose. Populate the _row_identifier with UUID if null (during the initial insert). But I guess 64-bit long is a shorter and more compact identifier and the describe inheritance also make the computation/population of the row identifier cheap.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with having a UUID alone, is that we can't track row origins. We would need to use some bits to identify the origin snapshot/sequence id of the row as well which would involve us either using two columns or some custom representation. The current approach uses sequence number approach and can be coupled with the "first_***" columns to determine which snapshot the row was added in.

@dyfrgi
Copy link

dyfrgi commented Sep 16, 2024

Is there a path for upgrading an existing Iceberg table to use row-lineage?

@RussellSpitzer
Copy link
Member Author

RussellSpitzer commented Sep 16, 2024

Is there a path for upgrading an existing Iceberg table to use row-lineage?

Turning on row-lineage would start tracking for all rows added after that point, i'm not sure we have a way of going back and adding history for previously existing rows. We could if we like, specify that existing rows should be treated as if they were created in the manifest in which they appear but that sounds a bit complicated.

format/spec.md Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated
requirement is controlled by setting the field `row-lineage` to true in the table's metadata. When true, two additional
fields in data files will be available for all rows added to the table.

* `_row_identifier` a unique long for every row. Computed via inheritance for rows in their original datafiles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the row lineage concept now. the main goal is to keep the row identifier the same in the case of file rewrites (like compaction, sorting etc.), because those operations don't insert new rows.

With row-level updates on primary key (identifier fields), each update (with new row value) would generate a new row identifier.

If my understanding is correct, this choice of monotonic integer is a bit like row-level sequence number.

Technically, UUID can also work for that purpose. Populate the _row_identifier with UUID if null (during the initial insert). But I guess 64-bit long is a shorter and more compact identifier and the describe inheritance also make the computation/population of the row identifier cheap.

format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Specification Issues that may introduce spec changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Row Lineage for V3
3 participants