Skip to content

Conversation

fvaleye
Copy link
Contributor

@fvaleye fvaleye commented Aug 20, 2025

Which issue does this PR close?

What changes are included in this PR?

Previously, the Iceberg Rust implementation only validated partition field names against schema field names when creating partition specs, but did not validate the reverse direction during schema evolution. This meant users could add schema fields that conflicted with existing partition field names, which could cause confusion and errors.

Are these changes tested?

Yes, with unit tests

@fvaleye fvaleye force-pushed the fix/schema-evolution-name-conflict branch 3 times, most recently from 3f0bafa to 2d6731c Compare August 20, 2025 09:23
/// # Errors
/// - Schema field name conflicts with existing partition field name that doesn't correspond to any current schema field.
fn check_schema_partition_name_conflicts(&self, schema: &Schema) -> Result<()> {
let existing_partition_field_names: HashSet<&str> = self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be too wasteful to build a hash set for it. Schema already contains mapping from field name to field ids, we could go through parttion field names and lookup it in schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will adapt this approach 👍

.flat_map(|spec| spec.fields().iter().map(|field| field.name.as_str()))
.collect();

let current_schema_field_names: HashSet<&str> = self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect, nested schema have more fields than first struct layer. You could use schema.field_id_to_name_map().values to go through all names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also do this check when adding spec?

Copy link
Contributor Author

@fvaleye fvaleye Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was asking myself the same question, but there is already a validation handled by PartitionSpecBuilder.build(). I added a proper testing case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This builder only validates current schema. Due to the multi version property of iceberg, I think we should validate against all schemas?

Copy link
Contributor Author

@fvaleye fvaleye Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great point.
We could introduce errors with the previous schema version 👍

Actually, for:

  • Historical Schema Compatibility
  • Time-Travel Queries
  • Schema Evolution

I’ll add a unit test to cover this and rework it.

@fvaleye fvaleye force-pushed the fix/schema-evolution-name-conflict branch from 2d6731c to 07c6883 Compare August 20, 2025 13:01
@fvaleye fvaleye requested a review from liurenjie1024 August 20, 2025 14:33
.any(|partition_field| &partition_field.name == field_name)
});

if conflicts_with_partition_field && current_schema.field_by_name(field_name).is_none()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to check current schema? Due to the multi version property of iceberg, I don't think we need to check if it's in current schema.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This builder only validates current schema. Due to the multi version property of iceberg, I think we should validate against all schemas?

@fvaleye fvaleye force-pushed the fix/schema-evolution-name-conflict branch from 49dcf84 to 1247b9a Compare August 21, 2025 15:01
@fvaleye
Copy link
Contributor Author

fvaleye commented Aug 21, 2025

Thank you for your help @liurenjie1024.

Following your suggestions, I added a check for potential conflicts across different schemas.
As a side note, I also extracted two helper functions to improve focus and readability.
Would love to hear your thoughts.

…ield names across all the table schemas

- Add a validation step in add_schema() method to prevent conflicts
- Add a validation step in add_partition_spec() method to prevent conflicts
@fvaleye fvaleye force-pushed the fix/schema-evolution-name-conflict branch from 1247b9a to b1293b3 Compare August 21, 2025 15:55
@liurenjie1024
Copy link
Contributor

Following your suggestions, I added a check for potential conflicts across different schemas.
As a side note, I also extracted two helper functions to improve focus and readability.
Would love to hear your thoughts.

Hi, @fvaleye I'm fine with small refactoring as long as it's not public api.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fvaleye for this pr!

@liurenjie1024 liurenjie1024 merged commit 55cc6c3 into apache:main Aug 22, 2025
17 checks passed
Yiyang-C pushed a commit to Yiyang-C/iceberg-rust that referenced this pull request Aug 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Partition field and schema field name conflicts not validated on schema evolution
2 participants