-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for add_files
and add_files_from_table
procedures in Iceberg
#22751
Conversation
...no-iceberg/src/test/java/io/trino/plugin/iceberg/procedure/TestIcebergAddFilesProcedure.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/AddFilesProcedure.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/AddFilesProcedure.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/HiveMigrations.java
Outdated
Show resolved
Hide resolved
checkProcedureArgument( | ||
table.schemas().size() == sourceTable.getDataColumns().size(), | ||
"Data column count mismatch: %d vs %d", table.schemas().size(), sourceTable.getDataColumns().size()); | ||
for (Column sourceColumn : sourceTable.getDataColumns()) { | ||
Types.NestedField targetColumn = schema.caseInsensitiveFindField(sourceColumn.getName()); | ||
if (targetColumn == null) { | ||
throw new TrinoException(COLUMN_NOT_FOUND, "Column '%s' does not exist".formatted(sourceColumn.getName())); | ||
} | ||
ColumnIdentity columnIdentity = createColumnIdentity(targetColumn); | ||
org.apache.iceberg.types.Type sourceColumnType = toIcebergType(typeManager.getType(getTypeSignature(sourceColumn.getType(), DEFAULT_PRECISION)), columnIdentity); | ||
if (!targetColumn.type().equals(sourceColumnType)) { | ||
throw new TrinoException(TYPE_MISMATCH, "Expected target '%s' type, but got source '%s' type".formatted(targetColumn.type(), sourceColumnType)); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These checks may be a little strict, if Iceberg supports coercion from the source column to the target type it should be okay as is.
Having additional columns in the source than the target also doesn't hurt, we'll just ignore them.
It is harder to mess up a call to the procedure this way, but I guess it's a question of how we expect people to use this. With perfectly matching schemas, or ones that have been tweaked a little.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should start with strict checks and tweak them based on user feedback.
The common use case looks adding files from Hive partitions after the initial migration. For instance, migrating the entire table on July 24, adding files from July 25 partition next day, .... until the end of migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Spark's version of add_files enforce strict checks, or is it permissive?
I'm also in favor of defaulting to strict mode, but giving the user the ability disable it with a non-strict
boolean flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
627e63d
to
46e9950
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/HiveMigrations.java
Outdated
Show resolved
Hide resolved
...no-iceberg/src/test/java/io/trino/plugin/iceberg/procedure/TestIcebergAddFilesProcedure.java
Outdated
Show resolved
Hide resolved
...no-iceberg/src/test/java/io/trino/plugin/iceberg/procedure/TestIcebergAddFilesProcedure.java
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/AddFilesProcedure.java
Outdated
Show resolved
Hide resolved
46e9950
to
fe148d7
Compare
What about sorted by? Is there a way to also allow sorted by columns in this way? |
fe148d7
to
ec6b98b
Compare
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/HiveMigrations.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/HiveMigrations.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergConfig.java
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/AddFilesProcedure.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/AddFilesProcedure.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/AddFilesProcedure.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/AddFilesProcedure.java
Outdated
Show resolved
Hide resolved
ec6b98b
to
4b1f362
Compare
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/AddFilesProcedure.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/AddFilesProcedure.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/AddFilesProcedure.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/AddFilesProcedure.java
Outdated
Show resolved
Hide resolved
...no-iceberg/src/test/java/io/trino/plugin/iceberg/procedure/TestIcebergAddFilesProcedure.java
Outdated
Show resolved
Hide resolved
4b1f362
to
0462ad7
Compare
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/procedure/AddFilesProcedure.java
Outdated
Show resolved
Hide resolved
@ebyhr please update description to match updated syntax |
...no-iceberg/src/test/java/io/trino/plugin/iceberg/procedure/TestIcebergAddFilesProcedure.java
Outdated
Show resolved
Hide resolved
...no-iceberg/src/test/java/io/trino/plugin/iceberg/procedure/TestIcebergAddFilesProcedure.java
Outdated
Show resolved
Hide resolved
0462ad7
to
7760d49
Compare
Let's split the function for adding files from a path and from another table into separate ones. Having overloads where most of the arguments are different is error prone and confusing for users. |
7760d49
to
b56c022
Compare
add_files
procedure in Icebergadd_files_from_location
and add_files_from_table
procedures in Iceberg
@martint Separated into |
b56c022
to
f5fda78
Compare
...no-iceberg/src/test/java/io/trino/plugin/iceberg/procedure/TestIcebergAddFilesProcedure.java
Outdated
Show resolved
Hide resolved
A couple of questions/comments:
|
f5fda78
to
9e7efc6
Compare
add_files_from_location
and add_files_from_table
procedures in Icebergadd_files
and add_files_from_table
procedures in Iceberg
9e7efc6
to
ef8dd05
Compare
Add add_files_from_table and add_files procedures in Iceberg connector. The add_files procedure is disabled by deafult because location based access conrol is not supported in Trino.
ef8dd05
to
c6cc6d1
Compare
@martint Removed |
Description
Spark supports adding files from tables or locations with add_files procedure. This procedure is helpful for migrating specific Hive partitions, or importing files on the storage. It would be nice to support the same procedure in Trino Iceberg connector.
This PR adds
add_files
andadd_files_from_table
procedures.In
add_files
procedure, `recursive_directory argument is optional:In
add_files_from_table
procedure,partition_filter
andrecursive_directory
arguments are optional:The
add_files
procedure is disabled by default withiceberg.add-files-procedure.enabled
config property because OSS Trino doesn't support location based access control.Fixes #11744
Release notes