Skip to content

[SPARK-52040][PYTHON][SQL][CONNECT] ResolveLateralColumnAliasReference should retain the plan id #50831

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented May 8, 2025

What changes were proposed in this pull request?

ResolveLateralColumnAliasReference should retain the plan id

Why are the changes needed?

bug fix

before:

In [1]: from pyspark.sql import functions as sf

In [2]: df1 = spark.range(10).select((sf.col("id") + sf.lit(1)).alias("x"), (sf.col("x") + sf.lit(1)).alias("y"))

In [3]: df2 = spark.range(10).select(sf.col("id").alias("x"))

In [4]: df1.join(df2, df1.x == df2.x).select(df1.y)
Out[4]: 25/05/08 16:38:28 ERROR ErrorUtils: Spark Connect RPC error during: analyze. UserId: ruifeng.zheng. SessionId: af3deba7-1e48-49fd-adad-2046a72ed341.
org.apache.spark.sql.AnalysisException: [CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "y". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704
	at org.apache.spark.sql.errors.QueryCompilationErrors$.cannotResolveDataFrameColumn(QueryCompilationErrors.scala:4147)
	at org.apache.spark.sql.catalyst.analysis.ColumnResolutionHelper.resolveDataFrameColumn(ColumnResolutionHelper.scala:562)
	at org.apache.spark.sql.catalyst.analysis.ColumnResolutionHelper.tryResolveDataFrameColumns(ColumnResolutionHelper.scala:537)

after:

In [1]: from pyspark.sql import functions as sf

In [2]: df1 = spark.range(10).select((sf.col("id") + sf.lit(1)).alias("x"), (sf.col("x") + sf.lit(1)).alias("y"))

In [3]: df2 = spark.range(10).select(sf.col("id").alias("x"))

In [4]: df1.join(df2, df1.x == df2.x).select(df1.y).show() 
                                                                                                                           +---+
|  y|
+---+
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
| 10|
+---+

Does this PR introduce any user-facing change?

yes, above query works after this change

How was this patch tested?

added test

Was this patch authored or co-authored using generative AI tooling?

no

@zhengruifeng zhengruifeng requested a review from cloud-fan May 8, 2025 08:42
@zhengruifeng zhengruifeng changed the title [SPARK-52040][SQL][CONNECT] ResolveLateralColumnAliasReference should retain the plan id [SPARK-52040][PYTHON][SQL][CONNECT] ResolveLateralColumnAliasReference should retain the plan id May 8, 2025
@github-actions github-actions bot removed the CONNECT label May 8, 2025
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you resolve the conflicts, @zhengruifeng ?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@xinrong-meng
Copy link
Member

SparkSessionE2ESuite and AdaptiveQueryExecSuite failed, would you rerun tests?

@xinrong-meng
Copy link
Member

LGTM thank you!

@cloud-fan
Copy link
Contributor

cloud-fan commented May 9, 2025

The k8s failure is unrelated, thanks, merging to master/4.0!

@cloud-fan cloud-fan closed this in 688281a May 9, 2025
cloud-fan pushed a commit that referenced this pull request May 9, 2025
…e should retain the plan id

ResolveLateralColumnAliasReference should retain the plan id

bug fix

before:
```
In [1]: from pyspark.sql import functions as sf

In [2]: df1 = spark.range(10).select((sf.col("id") + sf.lit(1)).alias("x"), (sf.col("x") + sf.lit(1)).alias("y"))

In [3]: df2 = spark.range(10).select(sf.col("id").alias("x"))

In [4]: df1.join(df2, df1.x == df2.x).select(df1.y)
Out[4]: 25/05/08 16:38:28 ERROR ErrorUtils: Spark Connect RPC error during: analyze. UserId: ruifeng.zheng. SessionId: af3deba7-1e48-49fd-adad-2046a72ed341.
org.apache.spark.sql.AnalysisException: [CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "y". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704
	at org.apache.spark.sql.errors.QueryCompilationErrors$.cannotResolveDataFrameColumn(QueryCompilationErrors.scala:4147)
	at org.apache.spark.sql.catalyst.analysis.ColumnResolutionHelper.resolveDataFrameColumn(ColumnResolutionHelper.scala:562)
	at org.apache.spark.sql.catalyst.analysis.ColumnResolutionHelper.tryResolveDataFrameColumns(ColumnResolutionHelper.scala:537)
```

after:
```
In [1]: from pyspark.sql import functions as sf

In [2]: df1 = spark.range(10).select((sf.col("id") + sf.lit(1)).alias("x"), (sf.col("x") + sf.lit(1)).alias("y"))

In [3]: df2 = spark.range(10).select(sf.col("id").alias("x"))

In [4]: df1.join(df2, df1.x == df2.x).select(df1.y).show()
                                                                                                                           +---+
|  y|
+---+
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
| 10|
+---+
```

yes, above query works after this change

added test

no

Closes #50831 from zhengruifeng/fix_lca.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 688281a)
Signed-off-by: Wenchen Fan <[email protected]>
@zhengruifeng zhengruifeng deleted the fix_lca branch May 9, 2025 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants