Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49907] Support spark.ml on Connect #48791

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

wbo4958
Copy link
Contributor

@wbo4958 wbo4958 commented Nov 7, 2024

What changes were proposed in this pull request?

This PR supports running spark.ml on connect

Why are the changes needed?

It's a new feature that makes spark.ml run on connect environment.

Does this PR introduce any user-facing change?

Yes, new feature.

How was this patch tested?

The below manual test can work without any exception.

(pyspark) user@bobby:~ $ pyspark --remote sc://localhost
Python 3.11.10 (main, Oct  3 2024, 07:29:13) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 4.0.0.dev0
      /_/

Using Python version 3.11.10 (main, Oct  3 2024 07:29:13)
Client connected to the Spark Connect server at localhost
SparkSession available as 'spark'.
>>> from pyspark.ml.classification import (LogisticRegression,
...                                        LogisticRegressionModel)
>>> from pyspark.ml.linalg import Vectors
>>> 
>>> df = spark.createDataFrame([
...     (Vectors.dense([1.0, 2.0]), 1),
...     (Vectors.dense([2.0, -1.0]), 1),
...     (Vectors.dense([-3.0, -2.0]), 0),
...     (Vectors.dense([-1.0, -2.0]), 0),
... ], schema=['features', 'label'])
>>> lr = LogisticRegression()
>>> lr.setMaxIter(30)
LogisticRegression_a842693fc5e7
>>> model: LogisticRegressionModel = lr.fit(df)

>>> model.predictRaw(Vectors.dense([1.0, 2.0]))
DenseVector([-21.1048, 21.1048])
>>> assert model.getMaxIter() == 30
>>> model.summary.roc.show()
+---+---+                                                                                                                               
|FPR|TPR|
+---+---+
|0.0|0.0|
|0.0|0.5|
|0.0|1.0|
|0.5|1.0|
|1.0|1.0|
|1.0|1.0|
+---+---+

>>> model.summary.weightedRecall
1.0
>>> model.summary.recallByLabel
[1.0, 1.0]
>>> model.coefficients
DenseVector([10.3964, 4.513])
>>> model.intercept
1.6823489096339976
>>> model.transform(df).show()
+-----------+-----+--------------------+--------------------+----------+
|   features|label|       rawPrediction|         probability|prediction|
+-----------+-----+--------------------+--------------------+----------+
|  [1.0,2.0]|    1|[-21.104818251026...|[6.82800596288997...|       1.0|
| [2.0,-1.0]|    1|[-17.962094978515...|[1.58183529116629...|       1.0|
|[-3.0,-2.0]|    0|[38.5329050234205...|           [1.0,0.0]|       0.0|
|[-1.0,-2.0]|    0|[17.7401204317582...|[0.99999998025016...|       0.0|
+-----------+-----+--------------------+--------------------+----------+

>>> model.write().overwrite().save("/tmp/connect-ml-demo")
>>> loaded_model = LogisticRegressionModel.load("/tmp/connect-ml-demo")
>>> assert loaded_model.getMaxIter() == 30
>>> loaded_model.transform(df).show()
+-----------+-----+--------------------+--------------------+----------+
|   features|label|       rawPrediction|         probability|prediction|
+-----------+-----+--------------------+--------------------+----------+
|  [1.0,2.0]|    1|[-21.104818251026...|[6.82800596288997...|       1.0|
| [2.0,-1.0]|    1|[-17.962094978515...|[1.58183529116629...|       1.0|
|[-3.0,-2.0]|    0|[38.5329050234205...|           [1.0,0.0]|       0.0|
|[-1.0,-2.0]|    0|[17.7401204317582...|[0.99999998025016...|       0.0|
+-----------+-----+--------------------+--------------------+----------+

Was this patch authored or co-authored using generative AI tooling?

No

@michTalebzadeh
Copy link

You point

t's a new feature that makes spark.ml run on connect environment.

Can you please explain more on why this feature which already exists on Spark itself, will be beneficial on connect?

@hvanhovell
Copy link
Contributor

@michTalebzadeh the idea is that we give Spark Connect users the same functionality as existing classic users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants