Potential incompatibility of graphar-pyspark with SparkConnect #366

SemyonSinchenko · 2024-02-20T21:39:29Z

SemyonSinchenko
Feb 20, 2024
Collaborator

In Apache Spark starting from 3.4 the Spark Connect was introduced. It is absolutely different way to interact between python and scala. It looks like the recent release of Databricks Runtime (14.3 LTS) make Spark Connect as a default way of working. I see, that people start facing issues in python bindings of Microsoft Synapse ML: microsoft/SynapseML#2167

I guess, that all the libraries, that provide scala core and python bindings (via py4j) will face similar issues soon.

Synapse ML is well known spark extension and when I worked on graphar-pyspark it was one of the main inspirations for me.

Today, when graphar is so young, we have a chance to change everything. So, I see as an option to rewrite graphar-pyspark from py4j to pure pyspark to make it work with Spark Connect. I can do it, but I want do discuss it first. Maybe I'm missing something important.

Information from Databricks 14 release notes:

- sqlContext is not available. Azure Databricks recommends using the spark variable for the SparkSession instance.
- Spark Context (sc) is no longer available in Notebooks, or when using Databricks Connect on a cluster with shared access mode. The following sc functions are no longer available:
    emptyRDD, range, init_batched_serializer, parallelize, pickleFile, textFile, wholeTextFiles, binaryFiles, binaryRecords, sequenceFile, newAPIHadoopFile, newAPIHadoopRDD, hadoopFile, hadoopRDD, union, runJob, setSystemProperty, uiWebUrl, stop, setJobGroup, setLocalProperty, getConf
- The Dataset Info feature is no longer supported.
- There is no longer a dependency on the JVM when querying Apache Spark and as a consequence, internal APIs related to the JVM, such as _jsc, _jconf, _jvm, _jsparkSession, _jreader, _jc, _jseq, _jdf, _jmap, and _jcols are no longer supported.
- When accessing configuration values using spark.conf only dynamic runtime configuration values are accessible.
- Delta Live Tables analysis commands are not supported on shared clusters yet.

The most important for us is that _jvm is no longer available.

acezen · 2024-02-21T07:15:14Z

acezen
Feb 21, 2024
Collaborator

Hi，Sem, it seems that Spark Connect is set default by Databricks Runtime, does it will be set default by the Apache Spark? If so, the libraries that rely python bindings would break, it seems to be a huge change to Apache Spark

4 replies

SemyonSinchenko Feb 21, 2024
Collaborator Author

I don't think that Spark Connect will be a default in vanila Spark soon. But two points:

Databricks is the main contributor to Apache Spark. I guess because of that, Spark Connect may get priority attention in further releases, including new features and bugfixes;
Databricks is one of the biggest vendors that provide Apache Spark as a service. For example, a company where I'm working uses Databricks;

acezen Feb 21, 2024
Collaborator

OK, got it. Does the Databricks Runtime provides a option for user to not use Spark Connect as default?(I don't find in the release note). Anyway, since the graphar-pyspark is a developing module，I thinks it's ok for me to implement with pure pyspark. But there are some points:

The datasource is a little complicated, not sure wether it hard to implement with pyspark
we can split the implementation to several small part that from info to reader, writer, transformer.

SemyonSinchenko Feb 21, 2024
Collaborator Author

I'm trying to clarify the question: Is it possible to turn off spark connect in Databricks or not. Release Notes of Databricks is a mess, but from the first look, it looks like not.

Regarding the datasources. There is an option to skip datasource implementation at all and just provide in python top level functions like read_gar(GraphInfo) and write_gar(data, GraphInfo). Just use regular spark readers under the hood, choosing the right one based on format in info. Another option is to use datasources as a standalone package. In this case, we can provide only datasources as JAR in CP, and it should be enough. Hopefully, my recent refactoring already split scala graphar and scala graphar datasources, so it should be easier.

acezen Feb 21, 2024
Collaborator

Use datasources as a standalone package is a elegant refactoring and it's good for scala spark too!

acezen · 2024-02-21T08:16:10Z

acezen
Feb 21, 2024
Collaborator

cc/ @lixueclaire , Do you have any comment or foresight about the proposal?

1 reply

lixueclaire Feb 27, 2024
Collaborator

I support isolating scala GraphAr datasources into an independent package to enhance Spark version compatibility. Concerning Spark Connect, I notice that some PySpark APIs are not supported by Spark Connect now, maybe we should first verify the compatibility of essential PySpark APIs for GraphAr.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential incompatibility of graphar-pyspark with SparkConnect #366

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Potential incompatibility of graphar-pyspark with SparkConnect #366

SemyonSinchenko Feb 20, 2024 Collaborator

Replies: 2 comments · 5 replies

acezen Feb 21, 2024 Collaborator

SemyonSinchenko Feb 21, 2024 Collaborator Author

acezen Feb 21, 2024 Collaborator

SemyonSinchenko Feb 21, 2024 Collaborator Author

acezen Feb 21, 2024 Collaborator

acezen Feb 21, 2024 Collaborator

lixueclaire Feb 27, 2024 Collaborator

SemyonSinchenko
Feb 20, 2024
Collaborator

Replies: 2 comments 5 replies

acezen
Feb 21, 2024
Collaborator

SemyonSinchenko Feb 21, 2024
Collaborator Author

acezen Feb 21, 2024
Collaborator

SemyonSinchenko Feb 21, 2024
Collaborator Author

acezen Feb 21, 2024
Collaborator

acezen
Feb 21, 2024
Collaborator

lixueclaire Feb 27, 2024
Collaborator