[SPARK-49907] Support spark.ml on Connect #48791

wbo4958 · 2024-11-07T10:16:07Z

What changes were proposed in this pull request?

This PR supports running spark.ml on connect

Why are the changes needed?

It's a new feature that makes spark.ml run on connect environment.

Does this PR introduce any user-facing change?

Yes, new feature.

How was this patch tested?

The below manual test can work without any exception.

(pyspark) user@bobby:~ $ pyspark --remote sc://localhost
Python 3.11.10 (main, Oct  3 2024, 07:29:13) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 4.0.0.dev0
      /_/

Using Python version 3.11.10 (main, Oct  3 2024 07:29:13)
Client connected to the Spark Connect server at localhost
SparkSession available as 'spark'.
>>> from pyspark.ml.classification import (LogisticRegression,
...                                        LogisticRegressionModel)
>>> from pyspark.ml.linalg import Vectors
>>> 
>>> df = spark.createDataFrame([
...     (Vectors.dense([1.0, 2.0]), 1),
...     (Vectors.dense([2.0, -1.0]), 1),
...     (Vectors.dense([-3.0, -2.0]), 0),
...     (Vectors.dense([-1.0, -2.0]), 0),
... ], schema=['features', 'label'])
>>> lr = LogisticRegression()
>>> lr.setMaxIter(30)
LogisticRegression_a842693fc5e7
>>> model: LogisticRegressionModel = lr.fit(df)

>>> model.predictRaw(Vectors.dense([1.0, 2.0]))
DenseVector([-21.1048, 21.1048])
>>> assert model.getMaxIter() == 30
>>> model.summary.roc.show()
+---+---+                                                                                                                               
|FPR|TPR|
+---+---+
|0.0|0.0|
|0.0|0.5|
|0.0|1.0|
|0.5|1.0|
|1.0|1.0|
|1.0|1.0|
+---+---+

>>> model.summary.weightedRecall
1.0
>>> model.summary.recallByLabel
[1.0, 1.0]
>>> model.coefficients
DenseVector([10.3964, 4.513])
>>> model.intercept
1.6823489096339976
>>> model.transform(df).show()
+-----------+-----+--------------------+--------------------+----------+
|   features|label|       rawPrediction|         probability|prediction|
+-----------+-----+--------------------+--------------------+----------+
|  [1.0,2.0]|    1|[-21.104818251026...|[6.82800596288997...|       1.0|
| [2.0,-1.0]|    1|[-17.962094978515...|[1.58183529116629...|       1.0|
|[-3.0,-2.0]|    0|[38.5329050234205...|           [1.0,0.0]|       0.0|
|[-1.0,-2.0]|    0|[17.7401204317582...|[0.99999998025016...|       0.0|
+-----------+-----+--------------------+--------------------+----------+

>>> model.write().overwrite().save("/tmp/connect-ml-demo")
>>> loaded_model = LogisticRegressionModel.load("/tmp/connect-ml-demo")
>>> assert loaded_model.getMaxIter() == 30
>>> loaded_model.transform(df).show()
+-----------+-----+--------------------+--------------------+----------+
|   features|label|       rawPrediction|         probability|prediction|
+-----------+-----+--------------------+--------------------+----------+
|  [1.0,2.0]|    1|[-21.104818251026...|[6.82800596288997...|       1.0|
| [2.0,-1.0]|    1|[-17.962094978515...|[1.58183529116629...|       1.0|
|[-3.0,-2.0]|    0|[38.5329050234205...|           [1.0,0.0]|       0.0|
|[-1.0,-2.0]|    0|[17.7401204317582...|[0.99999998025016...|       0.0|
+-----------+-----+--------------------+--------------------+----------+

Was this patch authored or co-authored using generative AI tooling?

No

michTalebzadeh · 2024-11-13T13:14:01Z

You point

t's a new feature that makes spark.ml run on connect environment.

Can you please explain more on why this feature which already exists on Spark itself, will be beneficial on connect?

hvanhovell · 2024-11-13T14:39:28Z

@michTalebzadeh the idea is that we give Spark Connect users the same functionality as existing classic users.

Support connect ml on spark.ml

a36a422

github-actions bot added SQL ML PYTHON CONNECT labels Nov 7, 2024

Add toString to the allowed list

70f6d08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-49907] Support spark.ml on Connect #48791

[SPARK-49907] Support spark.ml on Connect #48791

wbo4958 commented Nov 7, 2024

michTalebzadeh commented Nov 13, 2024

hvanhovell commented Nov 13, 2024

[SPARK-49907] Support spark.ml on Connect #48791

Are you sure you want to change the base?

[SPARK-49907] Support spark.ml on Connect #48791

Conversation

wbo4958 commented Nov 7, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

michTalebzadeh commented Nov 13, 2024

hvanhovell commented Nov 13, 2024