Skip to content

[SPARK-51884][SQL]Part 1.a Add outer scope attributes for SubqueryExpression #50822

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

AveryQi115
Copy link
Contributor

What changes were proposed in this pull request?

  • Add OuterScopeAttrs and related getter and setter methods for SubqueryExpression
  • Change outerAttrs from Seq[Expression] to Seq[(Expression, Boolean)] and change related methods
  • Update the usage of SubqueryExpression and classes extending SubqueryExpression

Why are the changes needed?

Spark only supports one layer of correlation now and does not support nested correlation.
For example,

SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS (
 SELECT col1 FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == MAX(t1.col2)
)GROUP BY col1;

is supported and

SELECT col1 FROM VALUES (1, 2) t1 (col1, col2) WHERE EXISTS (
 SELECT col1 FROM VALUES (1, 2) t2 (col1, col2) WHERE t2.col2 == (
   SELECT MAX(t1.col2)
 )
)GROUP BY col1;

is not supported.

The reason spark does not support it is because the Analyzer and Optimizer resolves and plans Subquery in a recursive way.

The definition change for the SubqueryExpression adds the metadata OuterScopeAttrs which helps later rewrites for the Analyzer and Optimizer.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Current UT and Suite

Was this patch authored or co-authored using generative AI tooling?

No

@AveryQi115
Copy link
Contributor Author

cc: @agubichev @cloud-fan for review

@@ -2326,7 +2326,7 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
private def resolveSubQuery(
e: SubqueryExpression,
outer: LogicalPlan)(
f: (LogicalPlan, Seq[Expression]) => SubqueryExpression): SubqueryExpression = {
f: (LogicalPlan, Seq[(Expression, Boolean)]) => SubqueryExpression): SubqueryExpression = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a comment here to explain what this boolean means?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a separate class to store this would be even better

@@ -66,7 +66,8 @@ abstract class PlanExpression[T <: QueryPlan[_]] extends Expression {
* A base interface for expressions that contain a [[LogicalPlan]].
*
* @param plan: the subquery plan
* @param outerAttrs: the outer references in the subquery plan
* @param outerAttrs: the outer references in the subquery plan and a boolean flag marking whether
* the outer reference can be resolved in its immediate parent plan or not.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* the outer reference can be resolved in its immediate parent plan or not.
* the outer reference can be resolved in its immediate outer plan or not.

@@ -75,18 +76,35 @@ abstract class PlanExpression[T <: QueryPlan[_]] extends Expression {
*/
abstract class SubqueryExpression(
plan: LogicalPlan,
outerAttrs: Seq[Expression],
outerAttrs: Seq[(Expression, Boolean)],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say the plan is

OuterPlan1
  OuterPlan2
    SubqueryExpression(attr1, attr2)

If attr1 is from OuterPlan1, then it should be OuterReference. If attr2 is from OuterPlan2, it should be AttributeReference. The rationale is: SubqueryExpression lives in the OuterPlan2, so it can directly use AttributeReference to reference columns (but not in SubqueryExpression#plan).

Seems we don't need the boolean flag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decorrelation framework needs to distinguish between normal outerAttrs and outerScopeAttrs because we have some extra operations for outerScopeAttrs.

For this case, both attr2 and att1 should be OuterReference as attr2 can't be resolved in the SubqueryExpression.plan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean that attr2 can be AttributeReference and attr1 can be OuterReference(attr1), it's not that clear comparing to boolean flags. And later in the optimizer, we have stripOuterReference process for the SubqueryExpression and also we have checks to see if a expression/plan have OuterReference to decide whether the subquery plan is correlated but not processed or not. Having additional OuterWrapper in the SubqueryExpression is very misleading.

@@ -590,31 +638,40 @@ case class ListQuery(
*/
case class Exists(
plan: LogicalPlan,
outerAttrs: Seq[Expression] = Seq.empty,
outerAttrs: Seq[(Expression, Boolean)] = Seq.empty,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a case class with a scala doc instead of this tuple?

@@ -78,28 +80,37 @@ case class FunctionTableSubqueryArgumentExpression(
"WITH SINGLE PARTITION is mutually exclusive with PARTITION BY")

override def dataType: DataType = plan.schema

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we revert these new-line diffs?

override lazy val canonicalized: Expression = {
FunctionTableSubqueryArgumentExpression(
plan.canonicalized,
outerAttrs.map(_.canonicalized),
outerAttrs.map {case (expr, b) => (expr.canonicalized, b)},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not use b, but give it a meaningful name

@AveryQi115
Copy link
Contributor Author

closed as through offline discussion, we decided to choose this design: #50838

@AveryQi115 AveryQi115 closed this May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants