Skip to content

Conversation

zml1206
Copy link
Contributor

@zml1206 zml1206 commented Sep 4, 2025

What changes were proposed in this pull request?

Support disable code-gen for sort merge join other than FULL OUTER and Existence. In addition, adjust the default value of spark.sql.sortMergeJoinExec.buffer.in.memory.threshold to be consistent with the BUFFER_IN_MEMORY_THRESHOLD of other operators, which is 4096.

Why are the changes needed?

Avoid executor OOM when a single key matches too many rows in inner/left outer/ right outer sort merge join beccause of currentRows.
Before pr, we can only increase the executor memory. After pr, we can disable code-gen for sort merge join.

Does this PR introduce any user-facing change?

Yes, new configuration spark.sql.codegen.join.sortMergeJoin.enabled.

How was this patch tested?

Local test, before pr will OOM, after pr run successfully.

    withSQLConf(SQLConf.ENABLE_SORT_MERGE_JOIN_CODEGEN.key -> "false",
      SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
      val df1 = spark.range(1).map(_ => ("testkey", "testvalue1")).toDF("key", "value")
      val df2 = spark.range(50000000).map(_ => ("testkey", "testvalue2")).toDF("key", "value")
      df1.join(df2, Seq("key"), "left").show()
    }

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Sep 4, 2025
@zml1206 zml1206 changed the title [SPARK-53483] Support disable code-gen for sort merge join other than FULL OUTER and Existence [SPARK-53483][SQL] Support disable code-gen for sort merge join other than FULL OUTER and Existence Sep 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant