[SPARK] Implement Correct fit() and transform() in SparkKMeansOperator #508

Manas-Dikshit · 2025-02-27T17:44:38Z

This PR fixes the implementation of fit() and transform() in SparkKMeansOperator.java, ensuring correctness in Apache Spark MLlib's KMeans clustering.

Changes
✅ Properly converts JavaRDD<double[]> to Dataset using convertToDataFrame().
✅ Uses the correct features and prediction column names.
✅ Ensures transform() outputs a Tuple2<double[], Integer> for better usability.
✅ Implements predict() method to return only the cluster labels.

Issue Fixed
🔧 Fixes Issue #364: "Support Fit and Transform in SparkKMeansOperator"

Testing
✔ Verified clustering correctness using sample RDD<double[]> input.
✔ Checked that cluster centers and predictions are accurately extracted.

zkaoudi

Thank you @Manas-Dikshit

Update SparkKMeansOperator.java

5509aac

2pk03 requested a review from zkaoudi February 27, 2025 17:52

zkaoudi approved these changes Feb 27, 2025

View reviewed changes

zkaoudi merged commit 1d5736f into apache:main Feb 27, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK] Implement Correct fit() and transform() in SparkKMeansOperator #508

[SPARK] Implement Correct fit() and transform() in SparkKMeansOperator #508

Manas-Dikshit commented Feb 27, 2025

zkaoudi left a comment

[SPARK] Implement Correct fit() and transform() in SparkKMeansOperator #508

[SPARK] Implement Correct fit() and transform() in SparkKMeansOperator #508

Conversation

Manas-Dikshit commented Feb 27, 2025

zkaoudi left a comment

Choose a reason for hiding this comment