Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REST Catalog does not validate "to" identifier on rename table #11154

Open
haizhou-zhao opened this issue Sep 17, 2024 · 2 comments
Open

REST Catalog does not validate "to" identifier on rename table #11154

haizhou-zhao opened this issue Sep 17, 2024 · 2 comments
Labels
question Further information is requested

Comments

@haizhou-zhao
Copy link
Contributor

Query engine

Spark

Question

Background

Spark will pass catalog name to renameTable operations as part of its to identifier, and if that catalog name is not handled (i.e. stripped), then it will be treated as part of the namespace name. This will cause spark to rename the table into undesirable namespaces (or even just encounter namespace not exist issue and fail).

Traditional catalog like HiveCatalog has special pre-validation in renameTable operation for this purpose (ref: #1156), while RESTCatalog and JdbcCatalog (ref REST server implementation is adapter on JdbcCatalog) does not have it. This is causing spark command like ALTER TABLE ${tbl} RENAME TO ${tbl_rename} to fail on RESTCatalog/JdbcCatalog such as in this integration test: https://github.com/apache/iceberg/blob/316f0a1/spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestAlterTable.java#L296

Part of #11079

Questions

  1. Should RESTCatalog (client) validate to identifier in renameTable operation just like what HiveCatalog is doing?
  2. If RESTCatalog, like hive, always strip the first level of namespace for to identifier, then what if the intention is to rename to a legitimate multi-level namespace?
@haizhou-zhao haizhou-zhao added the question Further information is requested label Sep 17, 2024
@haizhou-zhao haizhou-zhao changed the title Reference REST Catalog does not validate "to" identifier on rename table REST Catalog does not validate "to" identifier on rename table Sep 17, 2024
@nastra
Copy link
Contributor

nastra commented Sep 25, 2024

I analyzed this and RenameTableExec indeed adds the namespace to the to identifier: https://github.com/apache/spark/blob/branch-3.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/RenameTableExec.scala#L46-L51.

Stripping the catalog name was added to Hive with #1156, but what I think we need to investigate here is why this issue only happens in testing and not when running through the Spark quickstart example, which also uses the REST catalog.

@haizhou-zhao
Copy link
Contributor Author

haizhou-zhao commented Oct 4, 2024

Most likely it's related to whether you provide catalog name to rename statement or not.

Here's what I did:

  1. Spin up the docker compose, so that an REST catalog is up on localhost:8181
  2. Start spark-shell with the following config:
    spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog
    spark.sql.catalog.iceberg.catalog-impl=org.apache.iceberg.rest.RESTCatalog
    spark.sql.catalog.iceberg.uri=http://localhost:8181/
    spark.sql.catalog.iceberg.io-impl=org.apache.iceberg.io.ResolvingFileIO
    spark.sql.catalog.iceberg.warehouse=file://tmp/warehouse/
    spark.sql.defaultCatalog=iceberg
    
  3. Run the following two sets of commands, even though they are expected to yield the same result, case 1 will fail, yet case 2 will succeed. The difference is whether the catalog name is explicitly mentioned or not.
    Case 1:
    spark.sql("CREATE NAMESPACE iceberg.ns1")
    spark.sql("CREATE NAMESPACE iceberg.ns2")
    spark.sql("CREATE TABLE iceberg.ns1.tbl1 (id INT, data STRING)")
    spark.sql("ALTER TABLE iceberg.ns1.tbl1 RENAME TO iceberg.ns2.tbl2")
    
    Case 2:
    spark.sql("CREATE NAMESPACE ns1")
    spark.sql("CREATE NAMESPACE ns2")
    spark.sql("CREATE TABLE ns1.tbl1 (id INT, data STRING)")
    spark.sql("ALTER TABLE ns1.tbl1 RENAME TO ns2.tbl2")
    

Explanation:
If catalog name is explicitly mentioned, then it will be submitted as part of the to identifier to construct RenameTableExec. As RESTCatalog doesn't have capability to remove catalog name from the identifier, the rename operation will fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants