[SPARK-54002][DEPLOY] Support integrating BeeLine with Connect JDBC driver #52706

pan3793 · 2025-10-23T12:36:47Z

What changes were proposed in this pull request?

This PR modifies the classpath for bin/beeline - excluding spark-sql-core_*.jar, spark-connect_*.jar, etc., and adding jars/connect-repl/*.jar, making it as same as bin/spark-connect-shell, the modified classpath looks like

jars/*.jar - except for spark-sql-core_*.jar, spark-connect_*.jar, etc.
jars/connect-repl/*.jar - including spark-connect-client-jdbc_*.jar

Note: BeeLine itself only requires Hive jars and a few third-party utilities jars to run, excluding some spark-*.jars won't break BeeLine's existing capability for connecting to Thrift Server.

To ensure no change for classic Spark behavior, for Spark classic(default) distribution, the above changes only take effect when setting SPARK_CONNECT_BEELINE=1 explicitly. For convenience, this is enabled by default for the Spark connect distribution

Why are the changes needed?

It's a new feature, with this feature, users are allowed to use BeeLine as an SQL CLI to connect to the Spark Connect server.

Does this PR introduce any user-facing change?

No, this feature must be enabled by setting SPARK_CONNECT_BEELINE=1 explicitly for classic(default) Spark distribution.

How was this patch tested?

Launch a Connect Server first, in my case, the Connect Server (v4.1.0-preview2) runs at sc://localhost:15002. To ensure changes won't break the Thrift Server use case, also launch a Thrift Server at thrift://localhost:10000

Testing for dev mode

Building

$ build/sbt -Phive,hive-thriftserver clean package

Without setting SPARK_CONNECT_BEELINE=1, it fails as expected with No known driver to handle "jdbc:sc://localhost:15002"

$ SPARK_PREPEND_CLASSES=true bin/beeline -u jdbc:sc://localhost:15002
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly.
WARNING: Using incubator modules: jdk.incubator.vector
scan complete in 0ms
scan complete in 4ms
No known driver to handle "jdbc:sc://localhost:15002"
Beeline version 2.3.10 by Apache Hive
beeline>

With setting SPARK_CONNECT_BEELINE=1, it works as expected

$ SPARK_PREPEND_CLASSES=true SPARK_CONNECT_BEELINE=1 bin/beeline -u jdbc:sc://localhost:15002
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly.
WARNING: Using incubator modules: jdk.incubator.vector
Connecting to jdbc:sc://localhost:15002
Connected to: Apache Spark Connect Server (version 4.1.0-preview2)
Driver: Apache Spark Connect JDBC Driver (version 4.1.0-SNAPSHOT)
Error: Requested transaction isolation level REPEATABLE_READ is not supported (state=,code=0)
Beeline version 2.3.10 by Apache Hive
0: jdbc:sc://localhost:15002> select 'Hello, Spark Connect!', version() as server_version;
+------------------------+-------------------------------------------------+
| Hello, Spark Connect!  |                 server_version                  |
+------------------------+-------------------------------------------------+
| Hello, Spark Connect!  | 4.1.0 c5ff48cc2b2c5a1526789ae414ff4c63b053d3ec  |
+------------------------+-------------------------------------------------+
1 row selected (0.476 seconds)
0: jdbc:sc://localhost:15002>

Also, test with Thrift Server to ensure no impact on existing functionalities.

It works as expected both with and without SPARK_CONNECT_BEELINE=1

$ SPARK_PREPEND_CLASSES=true [SPARK_CONNECT_BEELINE=1] bin/beeline -u jdbc:hive2://localhost:10000
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly.
WARNING: Using incubator modules: jdk.incubator.vector
Connecting to jdbc:hive2://localhost:10000
Connected to: Spark SQL (version 4.1.0-preview2)
Driver: Hive JDBC (version 2.3.10)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.3.10 by Apache Hive
0: jdbc:hive2://localhost:10000> select 'Hello, Spark Connect!', version() as server_version;
+------------------------+-------------------------------------------------+
| Hello, Spark Connect!  |                 server_version                  |
+------------------------+-------------------------------------------------+
| Hello, Spark Connect!  | 4.1.0 c5ff48cc2b2c5a1526789ae414ff4c63b053d3ec  |
+------------------------+-------------------------------------------------+
1 row selected (0.973 seconds)
0: jdbc:hive2://localhost:10000>

Testing for Spark distribution

$ dev/make-distribution.sh --tgz --connect --name SPARK-54002 -Pyarn -Pkubernetes -Phadoop-3 -Phive -Phive-thriftserver

Spark classic distribution

$ tar -xzf spark-4.1.0-SNAPSHOT-bin-SPARK-54002.tgz
$ cd spark-4.1.0-SNAPSHOT-bin-SPARK-54002
$ bin/beeline -u jdbc:sc://localhost:15002 -e "select 'Hello, Spark Connect!', version() as server_version;"
... (negative result, fails with 'No known driver to handle "jdbc:sc://localhost:15002"')
$ SPARK_CONNECT_BEELINE=1 bin/beeline -u jdbc:sc://localhost:15002 -e "select 'Hello, Spark Connect!', version() as server_version;"
... (positive result)
$ bin/beeline -u jdbc:hive2://localhost:10000 -e "select 'Hello, Spark Connect!', version() as server_version;"
... (positive result)
$ SPARK_CONNECT_BEELINE=1 bin/beeline -u jdbc:hive2://localhost:10000 -e "select 'Hello, Spark Connect!', version() as server_version;"
... (positive result)

Spark connect distribution

$ tar -xzf spark-4.1.0-SNAPSHOT-bin-SPARK-54002-connect.tgz
$ cd spark-4.1.0-SNAPSHOT-bin-SPARK-54002-connect
$ bin/beeline -u jdbc:sc://localhost:15002 -e "select 'Hello, Spark Connect!', version() as server_version;"
... (positive result)
$ bin/beeline -u jdbc:hive2://localhost:10000 -e "select 'Hello, Spark Connect!', version() as server_version;"
... (positive result)

Was this patch authored or co-authored using generative AI tooling?

No.

pan3793 · 2025-10-23T13:09:06Z

~~This feature requires SPARK-53934 (#52705) to work with Connect Server, reviewers who want to verify this locally, should apply that patch manually, or wait for that patch get in.~~

SPARK-53934 has landed on master.

pan3793 · 2025-10-23T13:19:31Z

cc @LuciferYang @HyukjinKwon

### What changes were proposed in this pull request? This is the initial implementation of the Connect JDBC driver. In detail, this PR implements the essential JDBC interfaces listed below. - `java.sql.Driver` - `java.sql.Connection` - `java.sql.Statement` - `java.sql.ResultSet` - `java.sql.ResultSetMetaData` - `java.sql.DatabaseMetaData` At the first step, this PR only supports `NullType`, `BooleanType`, `ByteType`, `ShortType`, `IntegerType`, `LongType`, `FloatType`, `DoubleType`, and `StringType`. ### Why are the changes needed? Basically implement the feature proposed in [SPIP: JDBC Driver for Spark Connect](https://issues.apache.org/jira/browse/SPARK-53484) ### Does this PR introduce _any_ user-facing change? It's a new feature. ### How was this patch tested? New UTs are added. And I have also cross-verified BeeLine cases with SPARK-54002 (#52706) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #52705 from pan3793/SPARK-53934. Authored-by: Cheng Pan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun

IIUC, I believe this PR is a little beyond of the scope of SPARK-53484: SPIP JDBC Driver for Spark Connect because we didn't sign off for moving away from spark-sql-core_*.jar for this additional module.

This PR modifies the classpath for bin/beeline - excluding spark-sql-core_.jar and adding jars/connect-repl/.jar, making it as same as bin/spark-connect-shell, the modified classpath looks like

Specifically, I have a worry about this part which is blindly pushing the users away from the classic code.

- if (isRemote) {
+ if (isRemote || isBeeLine) {

Please make beeline work in the existing class path by default. New code path should be applied additionally by configuration or environment variables, @pan3793 .

pan3793 · 2025-10-24T06:06:37Z

Please make beeline work in the existing class path by default. New code path should be applied additionally by configuration or environment variables.

@dongjoon-hyun In general, I agree with your concerns and ideas to have a switch for the classpath assembly, but I think we can have a different default behavior, due to the following reasons:

Technically, BeeLine does NOT use Spark classes.
Spark integrates the vanilla Hive BeeLine without modification, the dependencies list can be found at Maven Central. Excluding some classic Spark jars should NOT be risky.
To not surprise users, we'd better make the usage of BeeLine with Connect Server out-of-the-box, then we should tune the classpath automatically.
If we want to achieve both 2 and make beeline work in the existing classpath by default, we must have a mechanism to distinguish which service BeeLine is going to connect to, which involves two questions:
1. We can parse the args in SparkClassCommandBuilder to distinguish the connect service if the user provides the JDBC URL in the command directly, e.g., beeline -u 'jdbc:sc://xxxx', but this means we need to process BeeLine args in Spark Launcher, which introduces additional complexity and is not eligible IMO.
2. BeeLine also allows users to use !connect <jdbc-url> to connect to a DBMS in interactive mode (after starting the CLI). In this case, we don't have a chance to dynamically change the classpath.

Given the above reasons, I think we can change the classpath as proposed by this PR by default, and have an internal switch (i.e., env var SPARK_BEELINE_CLASSIC and keep it for at least until 5.x) as a backdoor to allow the user to switch back to the original classpath if something goes wrong.

Or, if we are very conservative, we can provide a switch (e.g., env var SPARK_BEELINE_CONNECT) and then the user must set it explicitly before using BeeLine to connect to Connect Server. TBH, I think this hurts user experience.

$ SPARK_BEELINE_CONNECT=1 bin/beeline -u jdbc:sc://localhost:15002

or

$ SPARK_BEELINE_CONNECT=1 bin/beeline
beeline> !connect jdbc:sc://localhost:15002

also cc @LuciferYang

…river

dongjoon-hyun · 2025-10-24T14:56:47Z

Please note that Apache Spark's default distribution is non-Spark-Connect version. The majority users are not using Connect at all. So, there is nothing hurts the existing classic user experience here.

Or, if we are very conservative, we can provide a switch (e.g., env var SPARK_BEELINE_CONNECT) and then the user must set it explicitly before using BeeLine to connect to Connect Server. TBH, I think this hurts user experience.

As I mentioned, the only point which we disagree with the following.

Given the above reasons, I think we can change the classpath as proposed by this PR by default,

If this PR contains any removal of classic Spark behavior, I need to cast a veto here unfortunately.

LuciferYang · 2025-10-27T09:19:36Z

launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java

          if (isRemote && "1".equals(getenv("SPARK_SCALA_SHELL")) && project.equals("sql/core")) {
            continue;
          }
+          if (isBeeLine && project.equals("sql/core")) {


Is it possible to conduct an inspection on the BeeLine with Connect JDBC driver and apply special handling solely for this specific scenario?

@LuciferYang short answer - it's possible only in some cases, details were explained at #52706 (comment)

pan3793 · 2025-10-27T12:14:14Z

@dongjoon-hyun @LuciferYang I have updated the PR to keep the existing logic by default. Now, for Spark classic(default) distribution, the proposed classpath changes of bin/beeline only take effect when setting SPARK_CONNECT_BEELINE=1 explicitly. For convenience, this is enabled by default for the Spark connect distribution.

pan3793 marked this pull request as draft October 23, 2025 12:36

pan3793 mentioned this pull request Oct 23, 2025

[SPARK-53934][CONNECT] Initial implement Connect JDBC driver #52705

Closed

pan3793 marked this pull request as ready for review October 23, 2025 13:06

dongjoon-hyun requested changes Oct 23, 2025

View reviewed changes

[SPARK-54002][DEPLOY] Support integrating BeeLine with Connect JDBC d…

c8a66f8

…river

pan3793 force-pushed the SPARK-54002 branch from e2c563d to c8a66f8 Compare October 24, 2025 06:19

pan3793 marked this pull request as draft October 27, 2025 02:34

LuciferYang reviewed Oct 27, 2025

View reviewed changes

github-actions bot added the BUILD label Oct 27, 2025

Add a flag SPARK_CONNECT_BEELINE

1b1a48f

pan3793 force-pushed the SPARK-54002 branch from 1ccb9ed to 1b1a48f Compare October 27, 2025 11:40

pan3793 marked this pull request as ready for review October 27, 2025 12:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[SPARK-54002][DEPLOY] Support integrating BeeLine with Connect JDBC driver #52706

[SPARK-54002][DEPLOY] Support integrating BeeLine with Connect JDBC driver #52706

Uh oh!

pan3793 commented Oct 23, 2025 •

edited

Loading

Uh oh!

pan3793 commented Oct 23, 2025 •

edited

Loading

Uh oh!

pan3793 commented Oct 23, 2025

Uh oh!

dongjoon-hyun left a comment •

edited

Loading

Uh oh!

pan3793 commented Oct 24, 2025 •

edited

Loading

Uh oh!

dongjoon-hyun commented Oct 24, 2025

Uh oh!

LuciferYang Oct 27, 2025

Uh oh!

pan3793 Oct 27, 2025

Uh oh!

pan3793 commented Oct 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[SPARK-54002][DEPLOY] Support integrating BeeLine with Connect JDBC driver #52706

Are you sure you want to change the base?

[SPARK-54002][DEPLOY] Support integrating BeeLine with Connect JDBC driver #52706

Uh oh!

Conversation

pan3793 commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Testing for dev mode

Testing for Spark distribution

Spark classic distribution

Spark connect distribution

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

pan3793 commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pan3793 commented Oct 23, 2025

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Oct 24, 2025

Uh oh!

LuciferYang Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

pan3793 Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

pan3793 commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pan3793 commented Oct 23, 2025 •

edited

Loading

pan3793 commented Oct 23, 2025 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

pan3793 commented Oct 24, 2025 •

edited

Loading

pan3793 commented Oct 27, 2025 •

edited

Loading