[SPARK-53399][SQL] Merge Python UDFs #52238

peter-toth · 2025-09-04T16:06:34Z

What changes were proposed in this pull request?

Latest improvements to CollapseProject rule (like #33958) prevented duplicating expressive expressons, which brings considerable performance improvement in many cases.
But there is one particular case when it can introduce significant perfoamance degradation. Consider a query where the adjacent project nodes don't get collapsed because they contain expensive, multiple referenced expressions, but the nodes also contain Python UDF expressions that otherwise wouldn't prevent project node collapsion. E.g.:

Project a + a as a_plus_a, PythonUDF(...) as udf2, udf1
  Project <expensive calculation> as a, PythonUDF(...) as udf1
    ...

In the above example CollapseProject doesn't modify the 2 project nodes, which then causes 2 BatchEvalPython nodes to appear in the plan when ExtractPythonUDFs extracts them:

Project a + a as a_plus_a, udf2, udf1
  BatchEvalPython PythonUDF(...) -> udf2
    Project <expensive calculation> as a, udf1
      BatchEvalPython PythonUDF(...) -> udf1
        ...

The 2 BatchEvalPython nodes can cause significant serialization/deserialization overhead compared to the case when the original project nodes were collapsed and we had only 1 BatchEvalPython node.

The old behaviour can be restored with setting spark.sql.optimizer.collapseProjectAlwaysInline=true, but it is still not ideal as we lose the performance improvement in other cases.

This PR improves to the CollapseProject rule to force merging Python UDFs in project groups (multiple adjacent project nodes) when they can be executed in one run.

Why are the changes needed?

To fix performance regression caused by latest changes to CollapseProject.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New and existing UTs.

Was this patch authored or co-authored using generative AI tooling?

No.

peter-toth · 2025-09-04T16:18:38Z

cc @cloud-fan , @dongjoon-hyun

dongjoon-hyun · 2025-09-04T18:50:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala

+
+  def correctEvalType(udf: PythonUDF): Int = {
+    if (udf.evalType == PythonEvalType.SQL_ARROW_BATCHED_UDF) {
+      if (conf.pythonUDFArrowFallbackOnUDT &&


Can we simply get the value of conf.pythonUDFArrowFallbackOnUDT as a parameter? It looks too much for me to bring SQLConfHelper simply in order to use conf in this object. WDYT, @peter-toth ?

Fixed in 50664c5.

dongjoon-hyun · 2025-09-04T18:52:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

-      case Seq(child: PythonUDF) => correctEvalType(e) == correctEvalType(child) &&
-        shouldExtractUDFExpressionTree(child)
+      case Seq(child: PythonUDF) =>
+        PythonUDF.correctEvalType(e) == PythonUDF.correctEvalType(child) &&


Shall we simply import PythonUDF.correctEvalType method directly?

Sure, done in 50664c5.

cloud-fan · 2025-09-05T09:06:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala

@@ -51,6 +51,27 @@ object PythonUDF {
    // support new types in the future, e.g, N -> N transform.
    e.isInstanceOf[PythonUDAF]
  }
+
+  def correctEvalType(udf: PythonUDF): Int = {


what does "correct" mean here?

nvm, the code was moved from somwhere else.

cloud-fan · 2025-09-05T09:17:03Z

Shall we add a new rule for merging Python UDFs? I think it's orthogonal to CollapseProject: The new rule will pull up the lower Python UDF to the upper Project, even if it duplicates expressions or make the lower Project very wide. Then we leave it to CollapseProject to decide if we can completely merge the two Projects. I think it can produce better plan as we only duplicate minimal expressions to make Python UDFs live in the same Project.

peter-toth · 2025-09-05T09:28:19Z

Shall we add a new rule for merging Python UDFs? I think it's orthogonal to CollapseProject: The new rule will pull up the lower Python UDF to the upper Project, even if it duplicates expressions or make the lower Project very wide. Then we leave it to CollapseProject to decide if we can completely merge the two Projects. I think it can produce better plan as we only duplicate minimal expressions to make Python UDFs live in the same Project.

I want to revisit #52149 a bit later to not just collapse or don't collapse, but be able to merge certain expressions into upper and keep others in lower. I think once we do that in CollapseProject we don't need 2 separate rules (traversals on the plan).
Do you think we can merge this PR to fix the regression and I can work on an improvement next week?

cloud-fan · 2025-09-08T08:51:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

      case p1 @ Project(_, p2: Project)
-          if canCollapseExpressions(p1.projectList, p2.projectList, alwaysInline) =>
+          if canCollapseExpressions(


I'm OK to special-case Python UDF in CollpaseProject for now and refactor it later, but we should still make it efiicient.

Can we use the idea from #52149 to not completely merge two Projects for Python UDF? We can pull up the Python UDF from the lower Project to the upper one so that they live in the same Project. I.e. we add extra pattern matches after case ... if canCollapseExpressions so that if we can't fully collapse two Projects, we check if we can partially merge them for Python UDF.

I added the merge idea, but wanted handle the case fully in mergeProjectExpressions().

dongjoon-hyun · 2025-09-09T23:06:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -1319,7 +1447,7 @@ object CollapseProject extends Rule[LogicalPlan] with AliasHelper {

  def buildCleanedProjectList(
      upper: Seq[NamedExpression],
-      lower: Seq[NamedExpression]): Seq[NamedExpression] = {
+      lower: Iterable[NamedExpression]): Seq[NamedExpression] = {


Just a question. Where does this PR hand over non-Seq type with this method?

When we call it with a ListBuffer type mustInlines argument at: https://github.com/apache/spark/pull/52238/files#diff-11264d807efa58054cca2d220aae8fba644ee0f0f2a4722c46d52828394846efR1343

dongjoon-hyun · 2025-09-09T23:10:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala

@@ -51,6 +51,27 @@ object PythonUDF {
    // support new types in the future, e.g, N -> N transform.
    e.isInstanceOf[PythonUDAF]
  }
+
+  def correctEvalType(udf: PythonUDF, pythonUDFArrowFallbackOnUDT: Boolean): Int = {


Thank you for avoiding SparkConf.

dongjoon-hyun · 2025-09-09T23:12:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+            alwaysInline,
+            newPythonUDFEvalTypesInUpperProjects,
+            pythonUDFArrowFallbackOnUDT)
+          && canCollapseAggregate(p, agg) =>


nit. maybe more indentation?

Yes, done in 5ef3869.

dongjoon-hyun · 2025-09-09T23:20:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+
+  private def cheapToInlineProducer(
+      producer: NamedExpression,
+      relatedConsumers: Iterable[Expression]) = trimAliases(producer) match {


Can we use Seq[Expression] here?

I can restore the Seq[] type, but here: https://github.com/apache/spark/pull/52238/files#diff-11264d807efa58054cca2d220aae8fba644ee0f0f2a4722c46d52828394846efR1323 we pass in an ExpressionSet type relatedConsumers argument and all we do in cheapToInlineProducer() is to iterate on relatedConsumers.

dongjoon-hyun · 2025-09-10T00:30:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+    val pythonUDFArrowFallbackOnUDT = conf.pythonUDFArrowFallbackOnUDT
+
+    traverse(plan, alwaysInline, Set.empty, pythonUDFArrowFallbackOnUDT)
+  }


Shall we simplify a little more like the following?

def apply(plan: LogicalPlan, alwaysInline: Boolean): LogicalPlan = { traverse(plan, alwaysInline, Set.empty, conf.pythonUDFArrowFallbackOnUDT) }

Fixed in e2db127.

cloud-fan · 2025-09-10T17:28:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+    val neverInlines = ListBuffer.empty[NamedExpression]
+    val mustInlines = ListBuffer.empty[NamedExpression]
+    val maybeInlines = ListBuffer.empty[NamedExpression]
+    val others = ListBuffer.empty[NamedExpression]


This category is not explained in the comment above.

cloud-fan · 2025-09-10T17:32:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+          }
+        }
+
+      case o => others += o


this is always Attribute?

github-actions bot added SQL PYTHON labels Sep 4, 2025

peter-toth mentioned this pull request Sep 4, 2025

[SPARK-53399][SQL] Restore CollapseProject to merge unrelated expressions #52149

Draft

peter-toth force-pushed the SPARK-53399-collapse-python-udfs branch from c7eb6cb to 4814344 Compare September 4, 2025 17:36

dongjoon-hyun reviewed Sep 4, 2025

View reviewed changes

zhengruifeng requested review from ueshin and HyukjinKwon September 5, 2025 02:26

cloud-fan reviewed Sep 5, 2025

View reviewed changes

cloud-fan reviewed Sep 8, 2025

View reviewed changes

[SPARK-53399][SQL] Merge Python UDFs

ebff22c

peter-toth force-pushed the SPARK-53399-collapse-python-udfs branch from 50664c5 to ebff22c Compare September 9, 2025 22:54

dongjoon-hyun reviewed Sep 9, 2025

View reviewed changes

dongjoon-hyun reviewed Sep 10, 2025

View reviewed changes

peter-toth changed the title ~~[SPARK-53399][SQL] Collapse Python UDFs~~ [SPARK-53399][SQL] Merge Python UDFs Sep 10, 2025

peter-toth added 2 commits September 9, 2025 21:04

fix review findings

e2db127

add more comments

30f3be9

peter-toth force-pushed the SPARK-53399-collapse-python-udfs branch from 61632fe to 30f3be9 Compare September 10, 2025 04:37

peter-toth added 2 commits September 9, 2025 21:49

fix indent

5ef3869

fix comment

aa86729

dongjoon-hyun approved these changes Sep 10, 2025

View reviewed changes

cloud-fan reviewed Sep 10, 2025

View reviewed changes

cloud-fan approved these changes Sep 10, 2025

View reviewed changes

cloud-fan reviewed Sep 10, 2025

View reviewed changes

[SPARK-53399][SQL] Merge Python UDFs #52238

Are you sure you want to change the base?

[SPARK-53399][SQL] Merge Python UDFs #52238

Conversation

peter-toth commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

peter-toth commented Sep 4, 2025

Uh oh!

dongjoon-hyun Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Sep 5, 2025

Uh oh!

peter-toth commented Sep 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

peter-toth commented Sep 4, 2025 •

edited

Loading

dongjoon-hyun Sep 4, 2025 •

edited

Loading

dongjoon-hyun Sep 4, 2025 •

edited

Loading