Inconsistent id definition on Flink resolvedSchema conversion to iceberg schema #11128

tonycox · 2024-09-13T15:48:49Z

Apache Iceberg version

1.6.1 (latest release)

Query engine

Flink

Please describe the bug 🐞

When I try to convert Flink ResolvedSchema to Iceberg Schema via

import org.apache.iceberg.flink.FlinkSchemaUtil
FlinkSchemaUtil.convert(tableEnv.fromDataStream(dataStream).resolvedSchema)

It returns schema definition

table {
  0: event_time: optional timestamptz
  1: name: optional string
  2: json_map: optional map<string, string>
}

which as I suppose is not correct.
My assumption comes from whenever I call catalog.loadTable(id).schema() it returns

table {
  1: event_time: optional timestamptz
  2: name: optional string
  3: json_map: optional map<string, string>
}

and id validation will fail if let say I'll try to update schema upon extracted from Flink table.

Found lines of id definition:

iceberg/flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/FlinkTypeToType.java

Lines 187 to 189 in 7991206

    
           for (int i = 0; i < rowType.getFieldCount(); i++) { 
        
             int id = isRoot ? i : getNextId();

Willingness to contribute

I can contribute a fix for this bug independently
I would be willing to contribute a fix for this bug with guidance from the Iceberg community
I cannot contribute a fix for this bug at this time

The text was updated successfully, but these errors were encountered:

pvary · 2024-09-14T05:08:10Z

If you already have an Iceberg table, the source of truth is the Iceberg table. Other conversions are there for generating the schema for the Iceberg table creation.

Generating the same ids is not easily solved, because schema evolution would cause "skipped" ids

tonycox · 2024-09-18T15:57:23Z

@pvary
In the example the schema is the same, but in my case I wanted to have an "implicit" schema evolution on write. Say I'd add additional field to source event and on deployment step once the pipeline understands that the schema is updated it evolves target schema as well. Right now I'm skipping ids in schema validation everywhere, even in unit tests as they are inconsistent all the time and I rely only on the ordering of the fields and their existence/absence.

pvary · 2024-09-18T19:34:47Z

I'm facing a similar challenge. See: https://lists.apache.org/thread/vyw595d0747p33qg886b1o82mcw40523

The visitors could be used to traverse the schema, but you need to match them by name. This becomes problematic when the column names are reused

tonycox added the bug Something isn't working label Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent id definition on Flink resolvedSchema conversion to iceberg schema #11128

Inconsistent id definition on Flink resolvedSchema conversion to iceberg schema #11128

tonycox commented Sep 13, 2024 •

edited

Loading

pvary commented Sep 14, 2024

tonycox commented Sep 18, 2024 •

edited

Loading

pvary commented Sep 18, 2024

Inconsistent id definition on Flink resolvedSchema conversion to iceberg schema #11128

Inconsistent id definition on Flink resolvedSchema conversion to iceberg schema #11128

Comments

tonycox commented Sep 13, 2024 • edited Loading

Apache Iceberg version

Query engine

Please describe the bug 🐞

Willingness to contribute

pvary commented Sep 14, 2024

tonycox commented Sep 18, 2024 • edited Loading

pvary commented Sep 18, 2024

tonycox commented Sep 13, 2024 •

edited

Loading

tonycox commented Sep 18, 2024 •

edited

Loading