Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catalog migration CLI #2

Merged
merged 31 commits into from
Apr 18, 2023
Merged

Catalog migration CLI #2

merged 31 commits into from
Apr 18, 2023

Conversation

ajantha-bhat
Copy link
Contributor

@ajantha-bhat ajantha-bhat commented Dec 9, 2022

CLI developed based on picoCLI.

Please refer README.md for more details.

Copy link
Member

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice and helpful tool overall 👍

pom.xml Outdated Show resolved Hide resolved
@ajantha-bhat
Copy link
Contributor Author

ajantha-bhat commented Dec 12, 2022

Enforced with spotless and did spotless:apply. I had to squash the commit as the diff was huge.
Please recheck the PR.

CommandLine.getOut() is preferable, IMHO

mainly this comment is pending. I replied/handled most of the comments. I am figuring it out as I don't have CommandLine instance in the run

@dimas-b
Copy link
Member

dimas-b commented Dec 12, 2022

mainly this comment is pending. I replied/handled most of the comments. I am figuring it out as I don't have CommandLine instance in the run

You can inject @Spec private CommandSpec spec and then call spec.commandLine().

@ajantha-bhat
Copy link
Contributor Author

You can inject @SPEC private CommandSpec spec and then call spec.commandLine().

Awesome. It worked. I am still a noob with pico CLI

@ajantha-bhat
Copy link
Contributor Author

@dimas-b: Please go through the changes once again. If there is any comment. I will push it incrementally. I will not squash once you review the current state. Thanks for all the comments.

Copy link
Member

@snazy snazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helping users to migrate from one Iceberg catalog backend to another catalog in a user friendly way is great, but frankly there's a lot of work needed to achieve this goal. My biggest concern is the latter point: user friendliness. To achieve that, the tool must really:

  • help users identifying failures and clearly flag failures
  • provide a way to resume a partially failed migration
  • ensure that migrations of even 1000s of tables work fine and can be safely resumed even with intermittent failures
  • ensure that tables never appear in both the source and the target catalog (if the source catalog supports that) - consider the fact in the above resume/recovery mechanism
  • provide regular feedback about what the tool is exactly doing - using meaningful but not overloaded visual console aids

All the above is mandatory IMO, because the audience of this tool are end users, who need very clear error messages with instructions on how to continue/resume/recover.

The code probably does what it shall do if everything works fine an neither catalog complains / none of the remote requests times out or fails for other reasons. The tool's intended to help people migrate a potentially large (1000s) set of tables from one catalog to another catalog, but leaves the user completely on its own recovering the whole process from a single, potentially intermittent, failure. IMO this tool should focus on helping the user navigate through the whole process, especially covering failure and resume/recovery scenarios, which requires state and a lot of (abstracted) code and tests to validate that. Otherwise it's probably sufficient to just generate a piece of source code that people can paste into a Spark SQL shell or jshell or whatever.

The current options are: all tables from the source catalog or a comma separated list. I would not expect users to be happy when they have to copy-paste a command line argument of 100s or 1000s of table identifiers. This will exceed OS limits.

More notes:

The repository has neither CI nor release workflows nor is renovatebot configured. A SECURITY.md should really be added to the repository.

If these's intent to integrate this into NesQuEIT, then it is mandatory to use Gradle and not Maven.

I do not see any test classes in the code base, which is IMO concerning (#3 isn't really a functional test).

Why hasn't this been proposed to the Apache Iceberg project?
This is rather a generic Iceberg tool than a Nessie specific tool. Testing migrations from "any to any" catalog requires a lot of integration testing effort. Not sure if it's just me, but the code reminds me of work done quite a while ago and this one, which looked very very similar.

pom.xml Outdated Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
@nastra nastra self-requested a review December 14, 2022 16:31
@ajantha-bhat
Copy link
Contributor Author

@snazy: Thanks for the review and comments. Please find my reply inline.

Why hasn't this been proposed to the Apache Iceberg project?
This is rather a generic Iceberg tool than a Nessie specific tool. Testing migrations from "any to any" catalog requires a lot of integration testing effort. Not sure if it's just me, but the code reminds me of work done quite a while ago and this one, which looked very very similar.

Already this has been discussed in the Iceberg community and the community doesn't want this feature because it requires educating the user about not using this tool when there is an in-progress commit and also there is a risk of the user operating the same table from the multiple catalog. This tool is very useful but also comes with warnings.

Considering its usefulness we still want to have this tool. Hence, keeping this under the Nessie repository.

I do not see any test classes in the code base, which is IMO concerning (#3 isn't really a functional test).

manually verified. The plan was to add more testcases in the followup. I will add the functional testcase. But that requires implementing a new test catalog or mocking some existing catalogs. Effort for adding an IT is pretty big too.

If these's intent to integrate this into NesQuEIT, then it is mandatory to use Gradle and not Maven.

It don't see any specific reason to integrate to NesQuEIT as this tool is independent of query engines and NesQuEIT also doesn't provide multiple catalog testcases like glue, JDBC, REST etc. Might need some framework that supports all catalogs.

The repository has neither CI nor release workflows nor is renovatebot configured.

Plan was to add this after the testcases. I will work on it

A SECURITY.md should really be added to the repository.

Ok. I can add something like this (https://github.com/projectnessie/cel-java/blob/main/SECURITY.md)

The tool's intended to help people migrate a potentially large (1000s) set of tables from one catalog to another catalog, but leaves the user completely on its own recovering the whole process from a single, potentially intermittent, failure. IMO this tool should focus on helping the user navigate through the whole process, especially covering failure and resume/recovery scenarios

I believe it is just a metadata operation and already migrated tables will skip if the table already exists (won't abort the current migration). I will benchmark 10000 tables migration with catalogs like Arctic/Glue and then decide whether resume/recovery is needed.

The current options are: all tables from the source catalog or a comma separated list. I would not expect users to be happy when they have to copy-paste a command line argument of 100s or 1000s of table identifiers. This will exceed OS limits.

The identifiers are meant to use with a handful of tables. I can provide a config file to input the identifiers if we are concerned that users will use it with many tables.

This was referenced Dec 30, 2022
@ajantha-bhat ajantha-bhat marked this pull request as draft January 2, 2023 06:38
ajantha-bhat added a commit that referenced this pull request Jan 2, 2023
- execute `gradle init`
- updated the headers

pre-requisite for #2
@ajantha-bhat
Copy link
Contributor Author

Squashing the commits for easy rebase.

@ajantha-bhat ajantha-bhat marked this pull request as ready for review January 9, 2023 04:44
@ajantha-bhat
Copy link
Contributor Author

ajantha-bhat commented Jan 9, 2023

@snazy, @dimas-b

  • moved to gradle
  • depeding on nessie gradle-build-plugin for spotless, release publish etc etc. Added buildSrc module.
  • version.txt keeps the version. gradle and CLI versions are extracted from that
  • added security.md
  • writing failure callstack in the logs instead of console.
  • Added a nice summary and details on the cosole. Also, console will provide more info like what is happening currently.
  • Added an option 'identifiers-from-file' to read the identifiers from file to avoid typing big list on the console.
  • Failed table identifiers are written into a file failed_identifiers.txt which can be used to resume the operation (by retrying only the failed tables)
  • parallelized the list tables from each namespaces.
  • Added main branch CI workflow.
  • Added renovatebot
  • Added a warning on the console for hadoop table about not operating tables from two catalogs.
  • Added the Unit testcases using two Hadoop catalogs.

Follow-up/pending:

  • release workflow automation.
  • Integration tests.

@ajantha-bhat
Copy link
Contributor Author

ajantha-bhat commented Mar 31, 2023

@snazy:
Tested now (basic scenario) with DynamoDB, JDBC (found one bug and fixed), GLUE, HIVE, HADOOP, Nessie (also Arctic).
The migration worked fine.
This covers most of the catalogs.

Also, updated the README document with the catalog configuration example.

Runtime dependencies like hadoop-aws and hive-metastore depends on hadoop-common. As I am unsure what and all classes from hadoop-common is used in those two runtime dependencies, I didn't exclude transitive dependencies from hadoop-common. Final jar size is 157 MB now.

Also, just added mysql jdbc driver for JDBC catalog. If user needs a different driver based on the JDBC catalog's backend, might have to supplement that in the classpath as it won't be available with the executable jar.

@ajantha-bhat
Copy link
Contributor Author

@snazy: I rechecked the namespace creation and listing logic in CatalogMigrator
createNamespacesIfNotExistOnTargetCatalog creates the missing namespace on target catalog including parent namespaces for the nested case.

getAllNamespacesFromSourceCatalog recursively lists all the namespaces from the source catalog. Also, have a dedup logic with hash set because some catalogs doesn't follow namespace level restriction in the listNamespaces(namespace).

getMatchingTableIdentifiers, has logic to list tables from the default namespace too along with dedup logic (hash set) as some catalogs (like hadoop) provide all the tables when listed from the default namespace.

Test cases for all these (default namespace, nested namespace) are present in each catalog's catalog migrator test (ITHadoopToNessieCatalogMigrator, HadoopCatalogMigratorTest, ITNessieToHiveCatalogMigrator)

Copy link
Member

@snazy snazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still issues in the code, duplicated code, overly complex code.

@ajantha-bhat
Copy link
Contributor Author

@snazy: I have addressed the comments. Please check.

Copy link
Member

@snazy snazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed the comments.

There are still a lot of unaddressed comments.

build.gradle.kts Outdated Show resolved Hide resolved
gradle/baselibs.versions.toml Show resolved Hide resolved
gradle/libs.versions.toml Show resolved Hide resolved
README.md Outdated

Note: Options for migrate command is exactly same as register command.

> :warning: **It is recommended to use this CLI tool when there is no in-progress commits for the tables in the source catalog.**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too weak.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

improved the sentence by mentioning the consequences like data loss.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really? Nothing changed here.

Copy link
Contributor Author

@ajantha-bhat ajantha-bhat Apr 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously this statement was not there for this warning. I added it in the end.

Which can lead to missing updates, loss of data and table corruption.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update this with the new strong sentence from chatGPT

Avoid using this CLI tool when there are in-progress commits for tables in the source catalog to prevent missing updates, data loss, and table corruption in the target catalog. In-progress commits may not be properly transferred and could compromise the integrity of your data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you checked only the latest commits. These were addressed before that commit

678d5ca is the commit which fixed it.

README.md Show resolved Hide resolved
README.md Outdated
--source-catalog-properties warehouse=/tmp/warehouse,type=hadoop \
--target-catalog-type NESSIE \
--target-catalog-properties uri=https://nessie.test1.dremio.site/v1/repositories/8158e68a-5046-42c6-a7e4-c920d9ae2475,ref=main,warehouse=/tmp/warehouse,authentication.type=BEARER,authentication.token=$PAT \
--target-catalog-hadoop-conf fs.s3a.secret.key=$SECRETKEY,fs.s3a.access.key=$ACCESSKEY
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't these options support passing the keys via environment variables?
Definitely mention that passing secrets and other credentials on the command line is NOT safe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really? Nothing changed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed mention of keys in the catalog options descriptions and in README as it works without it.

I don't want to remove catalog-hadoop-conf option itself because I am not sure which catalogs uses it.
Iceberg also exposes hadoop conf for catalog configurations. So, I had to expose it.

https://github.com/apache/iceberg/blob/b756a5da3559be9deee9cdc674c4bf36ac291231/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L216

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Copy link
Member

@snazy snazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to -1 this again, for reasons mentioned a lot of times in previous -1 reviews in this and other PRs.

@ajantha-bhat
Copy link
Contributor Author

@snazy:
After pushing the latest changes,
I have gone through each unresolved comment from the beginning of this PR and replied to them.
I didn't mark them resolved as there can be a conflict of opinion or I might have misinterpreted some comments.
So, you can check it and make them resolved if it is resolved or help me understand the context more.

After going through all the comments, as per me two things are pending.

  • checking the buildSrc with Nessie and CEL-java again and keeping it updated and adding pieces about publishing.
  • exit code of CatalogMigrationCLI, even though I replaced exit code 2 with exit code 1, I am not sure whether exit code 1 can be used here as it is also reserved (https://picocli.info/apidocs/picocli/CommandLine.ExitCode.html)
    I found IExitCodeExceptionMapper but that is about mapping exception to new exit code. So, I am not sure what exit code to use here. I will continue to explore this.

@snazy
Copy link
Member

snazy commented Apr 14, 2023

Not sure I understand the last comment. Is this PR now reviewable or are there more changes coming?

@ajantha-bhat
Copy link
Contributor Author

After going through all the comments, as per me two things are pending.

  • checking the buildSrc with Nessie and CEL-java again and keeping it updated and adding pieces about publishing.
  • exit code of CatalogMigrationCLI, even though I replaced exit code 2 with exit code 1, I am not sure whether exit code 1 can be used here as it is also reserved (https://picocli.info/apidocs/picocli/CommandLine.ExitCode.html)
    I found IExitCodeExceptionMapper but that is about mapping exception to new exit code. So, I am not sure what exit code to use here. I will continue to explore this.

checking these two things. I will ping you once these two are addressed. So, it can be reviewed at once.

@ajantha-bhat
Copy link
Contributor Author

@snazy:

PR is ready for review.

I have handled build publishing too now (locally tested gradlew publishToMavenLocal).

I have also replied to every unresolved comment. So, If you feel anything is still unresolved, please comment back on the same thread for further discussions.

@snazy
Copy link
Member

snazy commented Apr 14, 2023

Suddenly, a lot of new stuff appeared in this PR via a force-push.

@ajantha-bhat
Copy link
Contributor Author

Suddenly, a lot of new stuff appeared in this PR via a force-push.

It was for addressing one of the open comments regarding "build publishing". I reverted it now. Will handle in a follow up PR to reduce review effort on this PR.

errorprone-annotations = { module = "com.google.errorprone:error_prone_annotations", version.ref = "errorprone" }
errorprone-core = { module = "com.google.errorprone:error_prone_core", version.ref = "errorprone" }
errorprone-slf4j = { module = "jp.skypencil.errorprone.slf4j:errorprone-slf4j", version.ref = "errorproneSlf4j" }
ecs-bundle = { module = "com.emc.ecs:object-client-bundle", version.ref = "ecs" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commercially licensed, NOT ASF2 licensed.
Remove it from this PR.
I do not wanna go down the rabbit hole and discuss this issue in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apache Iceberg (Apache 2.0 license) is also dependent on this for compile time iceberg-ecs module.

https://github.com/apache/iceberg/blob/b756a5da3559be9deee9cdc674c4bf36ac291231/build.gradle#L820

So, I thought it is compatible.

I will remove it. If the user gets a ClassNotFound while migrating to ECS catalog, they might have to add this jar in the classpath.

junit-jupiter-params = { module = "org.junit.jupiter:junit-jupiter-params" }
logback-classic = { module = "ch.qos.logback:logback-classic", version.ref = "logback" }
logcaptor = { module = "io.github.hakky54:logcaptor", version.ref = "logcaptor" }
mysql-driver = { module = "mysql:mysql-connector-java", version.ref = "mysqlDriver" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPL2 w/ exceptions licensed, NOT ASF2 licensed.
Remove it from this PR.
I do not wanna go down the rabbit hole and discuss this issue in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

User can add this to the classpath during migration to/from JDBC catalog (backed by MYSQL)

*/
public Set<TableIdentifier> getMatchingTableIdentifiers(String identifierRegex) {
Catalog sourceCatalog = sourceCatalog();
if (!(sourceCatalog instanceof SupportsNamespaces)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All catalogs can list tables, no?
org.apache.iceberg.catalog.Catalog#listTables

return sourceCatalog.listTables(namespace).stream()
.filter(matchedIdentifiersPredicate);
} catch (Exception exception) {
if (namespace.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why IllegalArgumentException?

README.md Outdated

Note: Options for migrate command is exactly same as register command.

> :warning: **It is recommended to use this CLI tool when there is no in-progress commits for the tables in the source catalog.**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really? Nothing changed here.

Copy link
Member

@snazy snazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some requested changes not tackled + licensing issues.

@ajantha-bhat
Copy link
Contributor Author

Some requested changes not tackled + licensing issues.

Fixed these comments.
Also replied about why catching IllegalArgumentException in CatalogMigrator.

}

private static void writeToFile(Path filePath, Collection<TableIdentifier> identifiers) {
if (identifiers.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we discussed this tool in detail, we concluded that the purpose of the "failed" and "successful" table-identifier list is to allow retries with the failed tables.

This condition however likely makes that scenario impossible and leads to ambiguous results - the same table identifier appearing in the files for successful and failed migrations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True.
Removed now.

targetCatalog().name()));
}

if (!(sourceCatalog() instanceof SupportsNamespaces)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised that this condition now appears here.
Does this mean that there are no catalogs that do not implement SupportsNamespaces and all the work and back-and-forth was just wasted effort?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the iceberg catalogs that are supported by this tool (maybe excluding custom catalog) implement SupportsNamespaces. There can be a custom catalog without SupportsNamespaces but I don't want to drag this discussion based on an imaginary catalog.

If such catlog exists in future, the user can raise a requirement and we can support it.

// some catalogs don't support default namespace.
// Hence, just log the warning and ignore the exception.
LOG.warn(
"Failed to identify tables from default namespace: {}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default is wrong here, same for the comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to empty namespace

* @param identifiers collection of table identifiers to register or migrate
* @return {@code this} for use in a chained invocation
*/
public CatalogMigrator registerTables(Collection<TableIdentifier> identifiers) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only used for tests, should go away here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

* If the migration is successful but deletion of some tables form source catalog is failed, summary will mention that these table names were written into the `failed_to_delete.txt` file and logs will capture the failure reason.
Do not operate these tables from the source catalog and user will have to delete them manually.

### B.4) executes the migration and out of 1000 tables. But manually aborts the migration by killing the process.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the user kills the process or another hard failure occurs, the "result files" will not be populated and the user "left in the dark". I do NOT want that to be fixed in this PR - but it HAS TO BE fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #8 for tracking

@ajantha-bhat
Copy link
Contributor Author

Addressed Today's comments.

@snazy snazy merged commit 1ea1ddc into projectnessie:main Apr 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants