Spec: Adds Row Lineage #11130

RussellSpitzer · 2024-09-13T22:26:53Z

Proposal Here :

https://docs.google.com/document/d/146YuAnU17prnIhyuvbCtCtVSavyd5N7hKryyVRaFDTE/edit#heading=h.f2e8ffw3fu7n

Adds Row Lineage to the Spec

End goal is to provide two fields to all rows

_row_id a unique long which identifies every row added to the table
_last_update the sequence number of the last commit to modify the row

Fixes #11129

RussellSpitzer · 2024-09-13T22:27:28Z

format/spec.md

+
+#### Datafile Propagation
+
+New data files added when `row-lineage` is enabled do not require any modification. The columns for `_row_identifiier`


As Recommended on the proposal, we don't actually need to include the columns on new files.

RussellSpitzer · 2024-09-13T22:28:51Z

format/spec.md

+requirement is controlled by setting the field `row-lineage` to true in the table's metadata. When true, two additional 
+fields in data files will be available for all rows added to the table.
+
+* `_row_identifier` a unique long for every row. Computed via inheritance for rows in their original datafiles 


One possible alternative from the doc was to have this be a combination of a random prefix and integer to remove the requirement of monotonic integer from the metadata. Since we have other monotonic integers in the metadata, I think this may not be that helpful unless we do a broad change.

I think I understand the row lineage concept now. the main goal is to keep the row identifier the same in the case of file rewrites (like compaction, sorting etc.), because those operations don't insert new rows.

With row-level updates on primary key (identifier fields), each update (with new row value) would generate a new row identifier.

If my understanding is correct, this choice of monotonic integer is a bit like row-level sequence number.

Technically, UUID can also work for that purpose. Populate the _row_identifier with UUID if null (during the initial insert). But I guess 64-bit long is a shorter and more compact identifier and the describe inheritance also make the computation/population of the row identifier cheap.

The problem with having a UUID alone, is that we can't track row origins. We would need to use some bits to identify the origin snapshot/sequence id of the row as well which would involve us either using two columns or some custom representation. The current approach uses sequence number approach and can be coupled with the "first_***" columns to determine which snapshot the row was added in.

dyfrgi · 2024-09-16T16:54:28Z

Is there a path for upgrading an existing Iceberg table to use row-lineage?

RussellSpitzer · 2024-09-16T17:16:49Z

Is there a path for upgrading an existing Iceberg table to use row-lineage?

Turning on row-lineage would start tracking for all rows added after that point, i'm not sure we have a way of going back and adding history for previously existing rows. We could if we like, specify that existing rows should be treated as if they were created in the manifest in which they appear but that sounds a bit complicated.

format/spec.md

stevenzwu · 2024-09-18T20:28:18Z

format/spec.md

+requirement is controlled by setting the field `row-lineage` to true in the table's metadata. When true, two additional 
+fields in data files will be available for all rows added to the table.
+
+* `_row_identifier` a unique long for every row. Computed via inheritance for rows in their original datafiles 


I think I understand the row lineage concept now. the main goal is to keep the row identifier the same in the case of file rewrites (like compaction, sorting etc.), because those operations don't insert new rows.

With row-level updates on primary key (identifier fields), each update (with new row value) would generate a new row identifier.

If my understanding is correct, this choice of monotonic integer is a bit like row-level sequence number.

Technically, UUID can also work for that purpose. Populate the _row_identifier with UUID if null (during the initial insert). But I guess 64-bit long is a shorter and more compact identifier and the describe inheritance also make the computation/population of the row identifier cheap.

format/spec.md

Spec: Adds Row Lineage

3cdeb9a

github-actions bot added the Specification Issues that may introduce spec changes. label Sep 13, 2024

RussellSpitzer commented Sep 13, 2024

View reviewed changes

Change Reserved Field Ids

8b9ff29

stevenzwu reviewed Sep 19, 2024

View reviewed changes

Reviewer Comments, column renames

69efedc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spec: Adds Row Lineage #11130

Spec: Adds Row Lineage #11130

RussellSpitzer commented Sep 13, 2024 •

edited

Loading

RussellSpitzer Sep 13, 2024

RussellSpitzer Sep 13, 2024

stevenzwu Sep 18, 2024

RussellSpitzer Sep 19, 2024

dyfrgi commented Sep 16, 2024

RussellSpitzer commented Sep 16, 2024 •

edited

Loading

stevenzwu Sep 18, 2024


		#### Datafile Propagation

		New data files added when `row-lineage` is enabled do not require any modification. The columns for `_row_identifiier`

Spec: Adds Row Lineage #11130

Are you sure you want to change the base?

Spec: Adds Row Lineage #11130

Conversation

RussellSpitzer commented Sep 13, 2024 • edited Loading

RussellSpitzer Sep 13, 2024

Choose a reason for hiding this comment

RussellSpitzer Sep 13, 2024

Choose a reason for hiding this comment

stevenzwu Sep 18, 2024

Choose a reason for hiding this comment

RussellSpitzer Sep 19, 2024

Choose a reason for hiding this comment

dyfrgi commented Sep 16, 2024

RussellSpitzer commented Sep 16, 2024 • edited Loading

stevenzwu Sep 18, 2024

Choose a reason for hiding this comment

RussellSpitzer commented Sep 13, 2024 •

edited

Loading

RussellSpitzer commented Sep 16, 2024 •

edited

Loading