-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spec: Adds Row Lineage #11130
base: main
Are you sure you want to change the base?
Spec: Adds Row Lineage #11130
Conversation
|
||
#### Datafile Propagation | ||
|
||
New data files added when `row-lineage` is enabled do not require any modification. The columns for `_row_identifiier` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Recommended on the proposal, we don't actually need to include the columns on new files.
format/spec.md
Outdated
requirement is controlled by setting the field `row-lineage` to true in the table's metadata. When true, two additional | ||
fields in data files will be available for all rows added to the table. | ||
|
||
* `_row_identifier` a unique long for every row. Computed via inheritance for rows in their original datafiles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possible alternative from the doc was to have this be a combination of a random prefix and integer to remove the requirement of monotonic integer from the metadata. Since we have other monotonic integers in the metadata, I think this may not be that helpful unless we do a broad change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand the row lineage concept now. the main goal is to keep the row identifier the same in the case of file rewrites (like compaction, sorting etc.), because those operations don't insert new rows.
With row-level updates on primary key (identifier fields), each update (with new row value) would generate a new row identifier.
If my understanding is correct, this choice of monotonic integer is a bit like row-level sequence number.
Technically, UUID can also work for that purpose. Populate the _row_identifier
with UUID if null (during the initial insert). But I guess 64-bit long is a shorter and more compact identifier and the describe inheritance also make the computation/population of the row identifier cheap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with having a UUID alone, is that we can't track row origins. We would need to use some bits to identify the origin snapshot/sequence id of the row as well which would involve us either using two columns or some custom representation. The current approach uses sequence number approach and can be coupled with the "first_***" columns to determine which snapshot the row was added in.
Is there a path for upgrading an existing Iceberg table to use row-lineage? |
Turning on row-lineage would start tracking for all rows added after that point, i'm not sure we have a way of going back and adding history for previously existing rows. We could if we like, specify that existing rows should be treated as if they were created in the manifest in which they appear but that sounds a bit complicated. |
format/spec.md
Outdated
requirement is controlled by setting the field `row-lineage` to true in the table's metadata. When true, two additional | ||
fields in data files will be available for all rows added to the table. | ||
|
||
* `_row_identifier` a unique long for every row. Computed via inheritance for rows in their original datafiles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand the row lineage concept now. the main goal is to keep the row identifier the same in the case of file rewrites (like compaction, sorting etc.), because those operations don't insert new rows.
With row-level updates on primary key (identifier fields), each update (with new row value) would generate a new row identifier.
If my understanding is correct, this choice of monotonic integer is a bit like row-level sequence number.
Technically, UUID can also work for that purpose. Populate the _row_identifier
with UUID if null (during the initial insert). But I guess 64-bit long is a shorter and more compact identifier and the describe inheritance also make the computation/population of the row identifier cheap.
Proposal Here :
https://docs.google.com/document/d/146YuAnU17prnIhyuvbCtCtVSavyd5N7hKryyVRaFDTE/edit#heading=h.f2e8ffw3fu7n
Adds Row Lineage to the Spec
End goal is to provide two fields to all rows
_row_id
a unique long which identifies every row added to the table_last_update
the sequence number of the last commit to modify the rowFixes #11129