-
Notifications
You must be signed in to change notification settings - Fork 28.5k
[SPARK-51884][SQL]Part 1.a Add outer scope attributes for SubqueryExpression #50822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-51884][SQL]Part 1.a Add outer scope attributes for SubqueryExpression #50822
Conversation
cc: @agubichev @cloud-fan for review |
@@ -2326,7 +2326,7 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor | |||
private def resolveSubQuery( | |||
e: SubqueryExpression, | |||
outer: LogicalPlan)( | |||
f: (LogicalPlan, Seq[Expression]) => SubqueryExpression): SubqueryExpression = { | |||
f: (LogicalPlan, Seq[(Expression, Boolean)]) => SubqueryExpression): SubqueryExpression = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a comment here to explain what this boolean means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a separate class to store this would be even better
@@ -66,7 +66,8 @@ abstract class PlanExpression[T <: QueryPlan[_]] extends Expression { | |||
* A base interface for expressions that contain a [[LogicalPlan]]. | |||
* | |||
* @param plan: the subquery plan | |||
* @param outerAttrs: the outer references in the subquery plan | |||
* @param outerAttrs: the outer references in the subquery plan and a boolean flag marking whether | |||
* the outer reference can be resolved in its immediate parent plan or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* the outer reference can be resolved in its immediate parent plan or not. | |
* the outer reference can be resolved in its immediate outer plan or not. |
@@ -75,18 +76,35 @@ abstract class PlanExpression[T <: QueryPlan[_]] extends Expression { | |||
*/ | |||
abstract class SubqueryExpression( | |||
plan: LogicalPlan, | |||
outerAttrs: Seq[Expression], | |||
outerAttrs: Seq[(Expression, Boolean)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's say the plan is
OuterPlan1
OuterPlan2
SubqueryExpression(attr1, attr2)
If attr1
is from OuterPlan1
, then it should be OuterReference
. If attr2
is from OuterPlan2
, it should be AttributeReference
. The rationale is: SubqueryExpression
lives in the OuterPlan2
, so it can directly use AttributeReference
to reference columns (but not in SubqueryExpression#plan
).
Seems we don't need the boolean flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decorrelation framework needs to distinguish between normal outerAttrs and outerScopeAttrs because we have some extra operations for outerScopeAttrs.
For this case, both attr2 and att1 should be OuterReference as attr2 can't be resolved in the SubqueryExpression.plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you mean that attr2 can be AttributeReference and attr1 can be OuterReference(attr1), it's not that clear comparing to boolean flags. And later in the optimizer, we have stripOuterReference process for the SubqueryExpression and also we have checks to see if a expression/plan have OuterReference to decide whether the subquery plan is correlated but not processed or not. Having additional OuterWrapper in the SubqueryExpression is very misleading.
@@ -590,31 +638,40 @@ case class ListQuery( | |||
*/ | |||
case class Exists( | |||
plan: LogicalPlan, | |||
outerAttrs: Seq[Expression] = Seq.empty, | |||
outerAttrs: Seq[(Expression, Boolean)] = Seq.empty, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a case class with a scala doc instead of this tuple?
@@ -78,28 +80,37 @@ case class FunctionTableSubqueryArgumentExpression( | |||
"WITH SINGLE PARTITION is mutually exclusive with PARTITION BY") | |||
|
|||
override def dataType: DataType = plan.schema | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we revert these new-line diffs?
override lazy val canonicalized: Expression = { | ||
FunctionTableSubqueryArgumentExpression( | ||
plan.canonicalized, | ||
outerAttrs.map(_.canonicalized), | ||
outerAttrs.map {case (expr, b) => (expr.canonicalized, b)}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not use b
, but give it a meaningful name
closed as through offline discussion, we decided to choose this design: #50838 |
What changes were proposed in this pull request?
Why are the changes needed?
Spark only supports one layer of correlation now and does not support nested correlation.
For example,
is supported and
is not supported.
The reason spark does not support it is because the Analyzer and Optimizer resolves and plans Subquery in a recursive way.
The definition change for the SubqueryExpression adds the metadata OuterScopeAttrs which helps later rewrites for the Analyzer and Optimizer.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Current UT and Suite
Was this patch authored or co-authored using generative AI tooling?
No