Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent id definition on Flink resolvedSchema conversion to iceberg schema #11128

Open
2 of 3 tasks
tonycox opened this issue Sep 13, 2024 · 3 comments
Open
2 of 3 tasks
Labels
bug Something isn't working

Comments

@tonycox
Copy link

tonycox commented Sep 13, 2024

Apache Iceberg version

1.6.1 (latest release)

Query engine

Flink

Please describe the bug 🐞

When I try to convert Flink ResolvedSchema to Iceberg Schema via

import org.apache.iceberg.flink.FlinkSchemaUtil
FlinkSchemaUtil.convert(tableEnv.fromDataStream(dataStream).resolvedSchema)

It returns schema definition

table {
  0: event_time: optional timestamptz
  1: name: optional string
  2: json_map: optional map<string, string>
}

which as I suppose is not correct.
My assumption comes from whenever I call catalog.loadTable(id).schema() it returns

table {
  1: event_time: optional timestamptz
  2: name: optional string
  3: json_map: optional map<string, string>
}

and id validation will fail if let say I'll try to update schema upon extracted from Flink table.

Found lines of id definition:

for (int i = 0; i < rowType.getFieldCount(); i++) {
int id = isRoot ? i : getNextId();

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@tonycox tonycox added the bug Something isn't working label Sep 13, 2024
@pvary
Copy link
Contributor

pvary commented Sep 14, 2024

If you already have an Iceberg table, the source of truth is the Iceberg table. Other conversions are there for generating the schema for the Iceberg table creation.

Generating the same ids is not easily solved, because schema evolution would cause "skipped" ids

@tonycox
Copy link
Author

tonycox commented Sep 18, 2024

@pvary
In the example the schema is the same, but in my case I wanted to have an "implicit" schema evolution on write. Say I'd add additional field to source event and on deployment step once the pipeline understands that the schema is updated it evolves target schema as well. Right now I'm skipping ids in schema validation everywhere, even in unit tests as they are inconsistent all the time and I rely only on the ordering of the fields and their existence/absence.

@pvary
Copy link
Contributor

pvary commented Sep 18, 2024

I'm facing a similar challenge. See: https://lists.apache.org/thread/vyw595d0747p33qg886b1o82mcw40523

The visitors could be used to traverse the schema, but you need to match them by name. This becomes problematic when the column names are reused

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants