Skip to content

Find long-term solution to convert managed tables to external without using unstable internal JVM/Scala APIs #4331

@pritishpai

Description

@pritishpai

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

We currently have an internal workflow to convert managed Spark tables to external tables. This relies on constructing CatalogTable objects directly from PySpark using internal JVM/Scala APIs.

This approach has proven brittle because:

  • CatalogTable constructor signatures frequently change between Spark versions (e.g., Spark 3.5.x added extra fields for Delta runtime properties).
  • Accessing Scala companion objects from Python via Py4J is complex (CatalogTable$.MODULE$, apply()), and breaks depending on Spark/Databricks runtime.
  • There is no stable public API to directly toggle a table from managed to external.

Proposed Next Steps:
Investigate if there is a public API or safe SQL-based flow for converting a managed table to an external table.
For example :
Use ALTER TABLE ... SET LOCATION and recreate table metadata if required or a supported SparkSession.catalog interface for externalization.

Explore implementing a JVM-side helper (Scala) to encapsulate this logic and avoid Python → Scala constructor hacks. Review and track other places where internal JVM/Scala references are used (e.g., SessionCatalog, CatalogStorageFormat.copy, CatalogTableType.EXTERNAL), and create follow-up issues for stabilizing those.

Expected Behavior

  • A robust, maintainable solution to convert managed → external tables that:
  • Does not depend on unstable internal constructors.
  • Works across Spark versions and Databricks runtimes.
  • Can be extended to other internal APIs where we rely on Scala internals today.

Steps To Reproduce

Run test_table_migration_convert_manged_to_external with a DBR 16+ cluster and check the convert_managed_to_external task

Cloud

Azure

Operating System

macOS

Version

latest via Databricks CLI

Relevant log output

py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.sql.catalyst.catalog.CatalogTable. Trace:
py4j.Py4JException: Constructor org.apache.spark.sql.catalyst.catalog.CatalogTable([class org.apache.spark.sql.catalyst.TableIdentifier, class org.apache.spark.sql.catalyst.catalog.CatalogTableType, class org.apache.spark.sql.catalyst.catalog.CatalogStorageFormat, class org.apache.spark.sql.types.StructType, class scala.Some, class scala.collection.immutable.Nil$, class scala.None$, class java.lang.String, class java.lang.Long, class java.lang.Integer, class java.lang.String, class scala.collection.immutable.HashMap$HashMap1, class scala.None$, class scala.None$, class scala.None$, class scala.collection.mutable.ArrayBuffer, class java.lang.Boolean, class java.lang.Boolean, class scala.collection.immutable.Map$Map1, class scala.None$, class scala.collection.immutable.Nil$]) does not exist
	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:203)
	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:220)
	at py4j.Gateway.invoke(Gateway.java:255)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:197)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:117)
	at java.base/java.lang.Thread.run(Thread.java:840)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions