-
Notifications
You must be signed in to change notification settings - Fork 99
Description
Is there an existing issue for this?
- I have searched the existing issues
Current Behavior
We currently have an internal workflow to convert managed Spark tables to external tables. This relies on constructing CatalogTable objects directly from PySpark using internal JVM/Scala APIs.
This approach has proven brittle because:
- CatalogTable constructor signatures frequently change between Spark versions (e.g., Spark 3.5.x added extra fields for Delta runtime properties).
- Accessing Scala companion objects from Python via Py4J is complex (CatalogTable$.MODULE$, apply()), and breaks depending on Spark/Databricks runtime.
- There is no stable public API to directly toggle a table from managed to external.
Proposed Next Steps:
Investigate if there is a public API or safe SQL-based flow for converting a managed table to an external table.
For example :
Use ALTER TABLE ... SET LOCATION and recreate table metadata if required or a supported SparkSession.catalog interface for externalization.
Explore implementing a JVM-side helper (Scala) to encapsulate this logic and avoid Python → Scala constructor hacks. Review and track other places where internal JVM/Scala references are used (e.g., SessionCatalog, CatalogStorageFormat.copy, CatalogTableType.EXTERNAL), and create follow-up issues for stabilizing those.
Expected Behavior
- A robust, maintainable solution to convert managed → external tables that:
- Does not depend on unstable internal constructors.
- Works across Spark versions and Databricks runtimes.
- Can be extended to other internal APIs where we rely on Scala internals today.
Steps To Reproduce
Run test_table_migration_convert_manged_to_external with a DBR 16+ cluster and check the convert_managed_to_external task
Cloud
Azure
Operating System
macOS
Version
latest via Databricks CLI
Relevant log output
py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.sql.catalyst.catalog.CatalogTable. Trace:
py4j.Py4JException: Constructor org.apache.spark.sql.catalyst.catalog.CatalogTable([class org.apache.spark.sql.catalyst.TableIdentifier, class org.apache.spark.sql.catalyst.catalog.CatalogTableType, class org.apache.spark.sql.catalyst.catalog.CatalogStorageFormat, class org.apache.spark.sql.types.StructType, class scala.Some, class scala.collection.immutable.Nil$, class scala.None$, class java.lang.String, class java.lang.Long, class java.lang.Integer, class java.lang.String, class scala.collection.immutable.HashMap$HashMap1, class scala.None$, class scala.None$, class scala.None$, class scala.collection.mutable.ArrayBuffer, class java.lang.Boolean, class java.lang.Boolean, class scala.collection.immutable.Map$Map1, class scala.None$, class scala.collection.immutable.Nil$]) does not exist
at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:203)
at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:220)
at py4j.Gateway.invoke(Gateway.java:255)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:197)
at py4j.ClientServerConnection.run(ClientServerConnection.java:117)
at java.base/java.lang.Thread.run(Thread.java:840)Metadata
Metadata
Assignees
Labels
Type
Projects
Status