-
Notifications
You must be signed in to change notification settings - Fork 305
fix(schema): check if new schema field names conflict with existing partition field names #1619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(schema): check if new schema field names conflict with existing partition field names #1619
Conversation
3f0bafa
to
2d6731c
Compare
/// # Errors | ||
/// - Schema field name conflicts with existing partition field name that doesn't correspond to any current schema field. | ||
fn check_schema_partition_name_conflicts(&self, schema: &Schema) -> Result<()> { | ||
let existing_partition_field_names: HashSet<&str> = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be too wasteful to build a hash set for it. Schema
already contains mapping from field name to field ids, we could go through parttion field names and lookup it in schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will adapt this approach 👍
.flat_map(|spec| spec.fields().iter().map(|field| field.name.as_str())) | ||
.collect(); | ||
|
||
let current_schema_field_names: HashSet<&str> = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect, nested schema have more fields than first struct layer. You could use schema.field_id_to_name_map().values
to go through all names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also do this check when adding spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was asking myself the same question, but there is already a validation handled by PartitionSpecBuilder.build()
. I added a proper testing case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This builder only validates current schema. Due to the multi version property of iceberg, I think we should validate against all schemas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great point.
We could introduce errors with the previous schema version 👍
Actually, for:
- Historical Schema Compatibility
- Time-Travel Queries
- Schema Evolution
I’ll add a unit test to cover this and rework it.
2d6731c
to
07c6883
Compare
.any(|partition_field| &partition_field.name == field_name) | ||
}); | ||
|
||
if conflicts_with_partition_field && current_schema.field_by_name(field_name).is_none() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to check current schema? Due to the multi version property of iceberg, I don't think we need to check if it's in current schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This builder only validates current schema. Due to the multi version property of iceberg, I think we should validate against all schemas?
49dcf84
to
1247b9a
Compare
Thank you for your help @liurenjie1024. Following your suggestions, I added a check for potential conflicts across different schemas. |
…ield names across all the table schemas - Add a validation step in add_schema() method to prevent conflicts - Add a validation step in add_partition_spec() method to prevent conflicts
1247b9a
to
b1293b3
Compare
Hi, @fvaleye I'm fine with small refactoring as long as it's not public api. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fvaleye for this pr!
…artition field names (apache#1619)
Which issue does this PR close?
What changes are included in this PR?
Previously, the Iceberg Rust implementation only validated partition field names against schema field names when creating partition specs, but did not validate the reverse direction during schema evolution. This meant users could add schema fields that conflicted with existing partition field names, which could cause confusion and errors.
Are these changes tested?
Yes, with unit tests