-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catalog migration CLI #2
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice and helpful tool overall 👍
b4ccf74
to
e59ee4a
Compare
Enforced with
mainly this comment is pending. I replied/handled most of the comments. I am figuring it out as I don't have |
You can inject |
Awesome. It worked. I am still a noob with pico CLI |
@dimas-b: Please go through the changes once again. If there is any comment. I will push it incrementally. I will not squash once you review the current state. Thanks for all the comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Helping users to migrate from one Iceberg catalog backend to another catalog in a user friendly way is great, but frankly there's a lot of work needed to achieve this goal. My biggest concern is the latter point: user friendliness. To achieve that, the tool must really:
- help users identifying failures and clearly flag failures
- provide a way to resume a partially failed migration
- ensure that migrations of even 1000s of tables work fine and can be safely resumed even with intermittent failures
- ensure that tables never appear in both the source and the target catalog (if the source catalog supports that) - consider the fact in the above resume/recovery mechanism
- provide regular feedback about what the tool is exactly doing - using meaningful but not overloaded visual console aids
All the above is mandatory IMO, because the audience of this tool are end users, who need very clear error messages with instructions on how to continue/resume/recover.
The code probably does what it shall do if everything works fine an neither catalog complains / none of the remote requests times out or fails for other reasons. The tool's intended to help people migrate a potentially large (1000s) set of tables from one catalog to another catalog, but leaves the user completely on its own recovering the whole process from a single, potentially intermittent, failure. IMO this tool should focus on helping the user navigate through the whole process, especially covering failure and resume/recovery scenarios, which requires state and a lot of (abstracted) code and tests to validate that. Otherwise it's probably sufficient to just generate a piece of source code that people can paste into a Spark SQL shell or jshell
or whatever.
The current options are: all tables from the source catalog or a comma separated list. I would not expect users to be happy when they have to copy-paste a command line argument of 100s or 1000s of table identifiers. This will exceed OS limits.
More notes:
The repository has neither CI nor release workflows nor is renovatebot configured. A SECURITY.md
should really be added to the repository.
If these's intent to integrate this into NesQuEIT, then it is mandatory to use Gradle and not Maven.
I do not see any test classes in the code base, which is IMO concerning (#3 isn't really a functional test).
Why hasn't this been proposed to the Apache Iceberg project?
This is rather a generic Iceberg tool than a Nessie specific tool. Testing migrations from "any to any" catalog requires a lot of integration testing effort. Not sure if it's just me, but the code reminds me of work done quite a while ago and this one, which looked very very similar.
@snazy: Thanks for the review and comments. Please find my reply inline.
Already this has been discussed in the Iceberg community and the community doesn't want this feature because it requires educating the user about not using this tool when there is an in-progress commit and also there is a risk of the user operating the same table from the multiple catalog. This tool is very useful but also comes with warnings. Considering its usefulness we still want to have this tool. Hence, keeping this under the Nessie repository.
manually verified. The plan was to add more testcases in the followup. I will add the functional testcase. But that requires implementing a new test catalog or mocking some existing catalogs. Effort for adding an IT is pretty big too.
It don't see any specific reason to integrate to NesQuEIT as this tool is independent of query engines and NesQuEIT also doesn't provide multiple catalog testcases like glue, JDBC, REST etc. Might need some framework that supports all catalogs.
Plan was to add this after the testcases. I will work on it
Ok. I can add something like this (https://github.com/projectnessie/cel-java/blob/main/SECURITY.md)
I believe it is just a metadata operation and already migrated tables will skip if the table already exists (won't abort the current migration). I will benchmark 10000 tables migration with catalogs like Arctic/Glue and then decide whether resume/recovery is needed.
The identifiers are meant to use with a handful of tables. I can provide a config file to input the identifiers if we are concerned that users will use it with many tables. |
46d4582
to
924de42
Compare
- execute `gradle init` - updated the headers pre-requisite for #2
Squashing the commits for easy rebase. |
47f9928
to
52b6ea3
Compare
Follow-up/pending:
|
@snazy: Also, updated the README document with the catalog configuration example. Runtime dependencies like Also, just added mysql jdbc driver for JDBC catalog. If user needs a different driver based on the JDBC catalog's backend, might have to supplement that in the classpath as it won't be available with the executable jar. |
@snazy: I rechecked the namespace creation and listing logic in
Test cases for all these (default namespace, nested namespace) are present in each catalog's catalog migrator test ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still issues in the code, duplicated code, overly complex code.
cli/src/test/java/org/projectnessie/tools/catalog/migration/cli/AbstractCLIMigrationTest.java
Outdated
Show resolved
Hide resolved
cli/src/test/java/org/projectnessie/tools/catalog/migration/cli/AbstractCLIMigrationTest.java
Outdated
Show resolved
Hide resolved
cli/src/test/java/org/projectnessie/tools/catalog/migration/cli/AbstractCLIMigrationTest.java
Outdated
Show resolved
Hide resolved
cli/src/test/java/org/projectnessie/tools/catalog/migration/cli/AbstractCLIMigrationTest.java
Show resolved
Hide resolved
...src/test/java/org/projectnessie/tools/catalog/migration/api/AbstractTestCatalogMigrator.java
Outdated
Show resolved
Hide resolved
@snazy: I have addressed the comments. Please check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have addressed the comments.
There are still a lot of unaddressed comments.
api/src/main/java/org/projectnessie/tools/catalog/migration/api/CatalogMigrationUtil.java
Show resolved
Hide resolved
api/src/main/java/org/projectnessie/tools/catalog/migration/api/CatalogMigrator.java
Show resolved
Hide resolved
README.md
Outdated
|
||
Note: Options for migrate command is exactly same as register command. | ||
|
||
> :warning: **It is recommended to use this CLI tool when there is no in-progress commits for the tables in the source catalog.** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too weak.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
improved the sentence by mentioning the consequences like data loss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really? Nothing changed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previously this statement was not there for this warning. I added it in the end.
Which can lead to missing updates, loss of data and table corruption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update this with the new strong sentence from chatGPT
Avoid using this CLI tool when there are in-progress commits for tables in the source catalog to prevent missing updates, data loss, and table corruption in the target catalog. In-progress commits may not be properly transferred and could compromise the integrity of your data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you checked only the latest commits. These were addressed before that commit
678d5ca is the commit which fixed it.
README.md
Outdated
--source-catalog-properties warehouse=/tmp/warehouse,type=hadoop \ | ||
--target-catalog-type NESSIE \ | ||
--target-catalog-properties uri=https://nessie.test1.dremio.site/v1/repositories/8158e68a-5046-42c6-a7e4-c920d9ae2475,ref=main,warehouse=/tmp/warehouse,authentication.type=BEARER,authentication.token=$PAT \ | ||
--target-catalog-hadoop-conf fs.s3a.secret.key=$SECRETKEY,fs.s3a.access.key=$ACCESSKEY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't these options support passing the keys via environment variables?
Definitely mention that passing secrets and other credentials on the command line is NOT safe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really? Nothing changed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed mention of keys in the catalog options descriptions and in README as it works without it.
I don't want to remove catalog-hadoop-conf
option itself because I am not sure which catalogs uses it.
Iceberg also exposes hadoop conf for catalog configurations. So, I had to expose it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have to -1 this again, for reasons mentioned a lot of times in previous -1 reviews in this and other PRs.
api/src/main/java/org/projectnessie/tools/catalog/migration/api/CatalogMigrationUtil.java
Outdated
Show resolved
Hide resolved
869d1bf
to
845c339
Compare
@snazy: After going through all the comments, as per me two things are pending.
|
Not sure I understand the last comment. Is this PR now reviewable or are there more changes coming? |
checking these two things. I will ping you once these two are addressed. So, it can be reviewed at once. |
PR is ready for review. I have handled build publishing too now (locally tested I have also replied to every unresolved comment. So, If you feel anything is still unresolved, please comment back on the same thread for further discussions. |
Suddenly, a lot of new stuff appeared in this PR via a force-push. |
It was for addressing one of the open comments regarding "build publishing". I reverted it now. Will handle in a follow up PR to reduce review effort on this PR. |
gradle/libs.versions.toml
Outdated
errorprone-annotations = { module = "com.google.errorprone:error_prone_annotations", version.ref = "errorprone" } | ||
errorprone-core = { module = "com.google.errorprone:error_prone_core", version.ref = "errorprone" } | ||
errorprone-slf4j = { module = "jp.skypencil.errorprone.slf4j:errorprone-slf4j", version.ref = "errorproneSlf4j" } | ||
ecs-bundle = { module = "com.emc.ecs:object-client-bundle", version.ref = "ecs" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commercially licensed, NOT ASF2 licensed.
Remove it from this PR.
I do not wanna go down the rabbit hole and discuss this issue in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apache Iceberg (Apache 2.0 license) is also dependent on this for compile time iceberg-ecs
module.
https://github.com/apache/iceberg/blob/b756a5da3559be9deee9cdc674c4bf36ac291231/build.gradle#L820
So, I thought it is compatible.
I will remove it. If the user gets a ClassNotFound while migrating to ECS catalog, they might have to add this jar in the classpath.
gradle/libs.versions.toml
Outdated
junit-jupiter-params = { module = "org.junit.jupiter:junit-jupiter-params" } | ||
logback-classic = { module = "ch.qos.logback:logback-classic", version.ref = "logback" } | ||
logcaptor = { module = "io.github.hakky54:logcaptor", version.ref = "logcaptor" } | ||
mysql-driver = { module = "mysql:mysql-connector-java", version.ref = "mysqlDriver" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPL2 w/ exceptions licensed, NOT ASF2 licensed.
Remove it from this PR.
I do not wanna go down the rabbit hole and discuss this issue in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
User can add this to the classpath during migration to/from JDBC catalog (backed by MYSQL)
*/ | ||
public Set<TableIdentifier> getMatchingTableIdentifiers(String identifierRegex) { | ||
Catalog sourceCatalog = sourceCatalog(); | ||
if (!(sourceCatalog instanceof SupportsNamespaces)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All catalogs can list tables, no?
org.apache.iceberg.catalog.Catalog#listTables
...src/test/java/org/projectnessie/tools/catalog/migration/api/AbstractTestCatalogMigrator.java
Show resolved
Hide resolved
api/src/test/java/org/projectnessie/tools/catalog/migration/api/CatalogMigrationUtilTest.java
Outdated
Show resolved
Hide resolved
return sourceCatalog.listTables(namespace).stream() | ||
.filter(matchedIdentifiersPredicate); | ||
} catch (Exception exception) { | ||
if (namespace.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why IllegalArgumentException
?
README.md
Outdated
|
||
Note: Options for migrate command is exactly same as register command. | ||
|
||
> :warning: **It is recommended to use this CLI tool when there is no in-progress commits for the tables in the source catalog.** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really? Nothing changed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some requested changes not tackled + licensing issues.
Fixed these comments. |
} | ||
|
||
private static void writeToFile(Path filePath, Collection<TableIdentifier> identifiers) { | ||
if (identifiers.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we discussed this tool in detail, we concluded that the purpose of the "failed" and "successful" table-identifier list is to allow retries with the failed tables.
This condition however likely makes that scenario impossible and leads to ambiguous results - the same table identifier appearing in the files for successful and failed migrations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True.
Removed now.
targetCatalog().name())); | ||
} | ||
|
||
if (!(sourceCatalog() instanceof SupportsNamespaces)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am surprised that this condition now appears here.
Does this mean that there are no catalogs that do not implement SupportsNamespaces
and all the work and back-and-forth was just wasted effort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the iceberg catalogs that are supported by this tool (maybe excluding custom catalog) implement SupportsNamespaces
. There can be a custom catalog without SupportsNamespaces
but I don't want to drag this discussion based on an imaginary catalog.
If such catlog exists in future, the user can raise a requirement and we can support it.
// some catalogs don't support default namespace. | ||
// Hence, just log the warning and ignore the exception. | ||
LOG.warn( | ||
"Failed to identify tables from default namespace: {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default
is wrong here, same for the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to empty
namespace
* @param identifiers collection of table identifiers to register or migrate | ||
* @return {@code this} for use in a chained invocation | ||
*/ | ||
public CatalogMigrator registerTables(Collection<TableIdentifier> identifiers) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only used for tests, should go away here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
* If the migration is successful but deletion of some tables form source catalog is failed, summary will mention that these table names were written into the `failed_to_delete.txt` file and logs will capture the failure reason. | ||
Do not operate these tables from the source catalog and user will have to delete them manually. | ||
|
||
### B.4) executes the migration and out of 1000 tables. But manually aborts the migration by killing the process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the user kills the process or another hard failure occurs, the "result files" will not be populated and the user "left in the dark". I do NOT want that to be fixed in this PR - but it HAS TO BE fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #8 for tracking
Addressed Today's comments. |
CLI developed based on picoCLI.
Please refer
README.md
for more details.