Skip to content

Conversation

@pan3793
Copy link
Member

@pan3793 pan3793 commented Oct 23, 2025

What changes were proposed in this pull request?

This PR modifies the classpath for bin/beeline - excluding spark-sql-core_*.jar, spark-connect_*.jar, etc., and adding jars/connect-repl/*.jar, making it as same as bin/spark-connect-shell, the modified classpath looks like

jars/*.jar - except for spark-sql-core_*.jar, spark-connect_*.jar, etc.
jars/connect-repl/*.jar - including spark-connect-client-jdbc_*.jar

Note: BeeLine itself only requires Hive jars and a few third-party utilities jars to run, excluding some spark-*.jars won't break BeeLine's existing capability for connecting to Thrift Server.

To ensure no change for classic Spark behavior, for Spark classic(default) distribution, the above changes only take effect when setting SPARK_CONNECT_BEELINE=1 explicitly. For convenience, this is enabled by default for the Spark connect distribution

Why are the changes needed?

It's a new feature, with this feature, users are allowed to use BeeLine as an SQL CLI to connect to the Spark Connect server.

Does this PR introduce any user-facing change?

No, this feature must be enabled by setting SPARK_CONNECT_BEELINE=1 explicitly for classic(default) Spark distribution.

How was this patch tested?

Launch a Connect Server first, in my case, the Connect Server (v4.1.0-preview2) runs at sc://localhost:15002. To ensure changes won't break the Thrift Server use case, also launch a Thrift Server at thrift://localhost:10000

Testing for dev mode

Building

$ build/sbt -Phive,hive-thriftserver clean package

Without setting SPARK_CONNECT_BEELINE=1, it fails as expected with No known driver to handle "jdbc:sc://localhost:15002"

$ SPARK_PREPEND_CLASSES=true bin/beeline -u jdbc:sc://localhost:15002
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly.
WARNING: Using incubator modules: jdk.incubator.vector
scan complete in 0ms
scan complete in 4ms
No known driver to handle "jdbc:sc://localhost:15002"
Beeline version 2.3.10 by Apache Hive
beeline>

With setting SPARK_CONNECT_BEELINE=1, it works as expected

$ SPARK_PREPEND_CLASSES=true SPARK_CONNECT_BEELINE=1 bin/beeline -u jdbc:sc://localhost:15002
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly.
WARNING: Using incubator modules: jdk.incubator.vector
Connecting to jdbc:sc://localhost:15002
Connected to: Apache Spark Connect Server (version 4.1.0-preview2)
Driver: Apache Spark Connect JDBC Driver (version 4.1.0-SNAPSHOT)
Error: Requested transaction isolation level REPEATABLE_READ is not supported (state=,code=0)
Beeline version 2.3.10 by Apache Hive
0: jdbc:sc://localhost:15002> select 'Hello, Spark Connect!', version() as server_version;
+------------------------+-------------------------------------------------+
| Hello, Spark Connect!  |                 server_version                  |
+------------------------+-------------------------------------------------+
| Hello, Spark Connect!  | 4.1.0 c5ff48cc2b2c5a1526789ae414ff4c63b053d3ec  |
+------------------------+-------------------------------------------------+
1 row selected (0.476 seconds)
0: jdbc:sc://localhost:15002>

Also, test with Thrift Server to ensure no impact on existing functionalities.

It works as expected both with and without SPARK_CONNECT_BEELINE=1

$ SPARK_PREPEND_CLASSES=true [SPARK_CONNECT_BEELINE=1] bin/beeline -u jdbc:hive2://localhost:10000
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly.
WARNING: Using incubator modules: jdk.incubator.vector
Connecting to jdbc:hive2://localhost:10000
Connected to: Spark SQL (version 4.1.0-preview2)
Driver: Hive JDBC (version 2.3.10)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.3.10 by Apache Hive
0: jdbc:hive2://localhost:10000> select 'Hello, Spark Connect!', version() as server_version;
+------------------------+-------------------------------------------------+
| Hello, Spark Connect!  |                 server_version                  |
+------------------------+-------------------------------------------------+
| Hello, Spark Connect!  | 4.1.0 c5ff48cc2b2c5a1526789ae414ff4c63b053d3ec  |
+------------------------+-------------------------------------------------+
1 row selected (0.973 seconds)
0: jdbc:hive2://localhost:10000>

Testing for Spark distribution

$ dev/make-distribution.sh --tgz --connect --name SPARK-54002 -Pyarn -Pkubernetes -Phadoop-3 -Phive -Phive-thriftserver
Spark classic distribution
$ tar -xzf spark-4.1.0-SNAPSHOT-bin-SPARK-54002.tgz
$ cd spark-4.1.0-SNAPSHOT-bin-SPARK-54002
$ bin/beeline -u jdbc:sc://localhost:15002 -e "select 'Hello, Spark Connect!', version() as server_version;"
... (negative result, fails with 'No known driver to handle "jdbc:sc://localhost:15002"')
$ SPARK_CONNECT_BEELINE=1 bin/beeline -u jdbc:sc://localhost:15002 -e "select 'Hello, Spark Connect!', version() as server_version;"
... (positive result)
$ bin/beeline -u jdbc:hive2://localhost:10000 -e "select 'Hello, Spark Connect!', version() as server_version;"
... (positive result)
$ SPARK_CONNECT_BEELINE=1 bin/beeline -u jdbc:hive2://localhost:10000 -e "select 'Hello, Spark Connect!', version() as server_version;"
... (positive result)
Spark connect distribution
$ tar -xzf spark-4.1.0-SNAPSHOT-bin-SPARK-54002-connect.tgz
$ cd spark-4.1.0-SNAPSHOT-bin-SPARK-54002-connect
$ bin/beeline -u jdbc:sc://localhost:15002 -e "select 'Hello, Spark Connect!', version() as server_version;"
... (positive result)
$ bin/beeline -u jdbc:hive2://localhost:10000 -e "select 'Hello, Spark Connect!', version() as server_version;"
... (positive result)

Was this patch authored or co-authored using generative AI tooling?

No.

@pan3793 pan3793 marked this pull request as draft October 23, 2025 12:36
@pan3793 pan3793 marked this pull request as ready for review October 23, 2025 13:06
@pan3793
Copy link
Member Author

pan3793 commented Oct 23, 2025

This feature requires SPARK-53934 (#52705) to work with Connect Server, reviewers who want to verify this locally, should apply that patch manually, or wait for that patch get in.

SPARK-53934 has landed on master.

@pan3793
Copy link
Member Author

pan3793 commented Oct 23, 2025

cc @LuciferYang @HyukjinKwon

dongjoon-hyun pushed a commit that referenced this pull request Oct 23, 2025
### What changes were proposed in this pull request?

This is the initial implementation of the Connect JDBC driver. In detail, this PR implements the essential JDBC interfaces listed below.

- `java.sql.Driver`
- `java.sql.Connection`
- `java.sql.Statement`
- `java.sql.ResultSet`
- `java.sql.ResultSetMetaData`
- `java.sql.DatabaseMetaData`

At the first step, this PR only supports `NullType`, `BooleanType`, `ByteType`, `ShortType`, `IntegerType`, `LongType`, `FloatType`, `DoubleType`, and `StringType`.

### Why are the changes needed?

Basically implement the feature proposed in [SPIP: JDBC Driver for Spark Connect](https://issues.apache.org/jira/browse/SPARK-53484)

### Does this PR introduce _any_ user-facing change?

It's a new feature.

### How was this patch tested?

New UTs are added.

And I have also cross-verified BeeLine cases with SPARK-54002 (#52706)

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #52705 from pan3793/SPARK-53934.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, I believe this PR is a little beyond of the scope of SPARK-53484: SPIP JDBC Driver for Spark Connect because we didn't sign off for moving away from spark-sql-core_*.jar for this additional module.

This PR modifies the classpath for bin/beeline - excluding spark-sql-core_.jar and adding jars/connect-repl/.jar, making it as same as bin/spark-connect-shell, the modified classpath looks like

Specifically, I have a worry about this part which is blindly pushing the users away from the classic code.

- if (isRemote) {
+ if (isRemote || isBeeLine) {

Please make beeline work in the existing class path by default. New code path should be applied additionally by configuration or environment variables, @pan3793 .

@pan3793
Copy link
Member Author

pan3793 commented Oct 24, 2025

Please make beeline work in the existing class path by default. New code path should be applied additionally by configuration or environment variables.

@dongjoon-hyun In general, I agree with your concerns and ideas to have a switch for the classpath assembly, but I think we can have a different default behavior, due to the following reasons:

  1. Technically, BeeLine does NOT use Spark classes.
    Spark integrates the vanilla Hive BeeLine without modification, the dependencies list can be found at Maven Central. Excluding some classic Spark jars should NOT be risky.

  2. To not surprise users, we'd better make the usage of BeeLine with Connect Server out-of-the-box, then we should tune the classpath automatically.

  3. If we want to achieve both 2 and make beeline work in the existing classpath by default, we must have a mechanism to distinguish which service BeeLine is going to connect to, which involves two questions:

    1. We can parse the args in SparkClassCommandBuilder to distinguish the connect service if the user provides the JDBC URL in the command directly, e.g., beeline -u 'jdbc:sc://xxxx', but this means we need to process BeeLine args in Spark Launcher, which introduces additional complexity and is not eligible IMO.
    2. BeeLine also allows users to use !connect <jdbc-url> to connect to a DBMS in interactive mode (after starting the CLI). In this case, we don't have a chance to dynamically change the classpath.

Given the above reasons, I think we can change the classpath as proposed by this PR by default, and have an internal switch (i.e., env var SPARK_BEELINE_CLASSIC and keep it for at least until 5.x) as a backdoor to allow the user to switch back to the original classpath if something goes wrong.

Or, if we are very conservative, we can provide a switch (e.g., env var SPARK_BEELINE_CONNECT) and then the user must set it explicitly before using BeeLine to connect to Connect Server. TBH, I think this hurts user experience.

$ SPARK_BEELINE_CONNECT=1 bin/beeline -u jdbc:sc://localhost:15002

or

$ SPARK_BEELINE_CONNECT=1 bin/beeline
beeline> !connect jdbc:sc://localhost:15002

also cc @LuciferYang

@dongjoon-hyun
Copy link
Member

Please note that Apache Spark's default distribution is non-Spark-Connect version. The majority users are not using Connect at all. So, there is nothing hurts the existing classic user experience here.

Or, if we are very conservative, we can provide a switch (e.g., env var SPARK_BEELINE_CONNECT) and then the user must set it explicitly before using BeeLine to connect to Connect Server. TBH, I think this hurts user experience.

As I mentioned, the only point which we disagree with the following.

Given the above reasons, I think we can change the classpath as proposed by this PR by default,

If this PR contains any removal of classic Spark behavior, I need to cast a veto here unfortunately.

@pan3793 pan3793 marked this pull request as draft October 27, 2025 02:34
if (isRemote && "1".equals(getenv("SPARK_SCALA_SHELL")) && project.equals("sql/core")) {
continue;
}
if (isBeeLine && project.equals("sql/core")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to conduct an inspection on the BeeLine with Connect JDBC driver and apply special handling solely for this specific scenario?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LuciferYang short answer - it's possible only in some cases, details were explained at #52706 (comment)

@github-actions github-actions bot added the BUILD label Oct 27, 2025
@pan3793 pan3793 marked this pull request as ready for review October 27, 2025 12:08
@pan3793
Copy link
Member Author

pan3793 commented Oct 27, 2025

@dongjoon-hyun @LuciferYang I have updated the PR to keep the existing logic by default. Now, for Spark classic(default) distribution, the proposed classpath changes of bin/beeline only take effect when setting SPARK_CONNECT_BEELINE=1 explicitly. For convenience, this is enabled by default for the Spark connect distribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants