Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49743][SQL] OptimizeCsvJsonExpr should not change schema fields when pruning GetArrayStructFields #48190

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

nikhilsheoran-db
Copy link
Contributor

@nikhilsheoran-db nikhilsheoran-db commented Sep 20, 2024

What changes were proposed in this pull request?

  • When pruning the schema of the struct in GetArrayStructFields, rely on the existing StructType to obtain the pruned schema instead of using the accessed field.

Why are the changes needed?

  • Fixes a bug in OptimizeCsvJsonExprs rule that would have otherwise changed the schema fields of the underlying struct to be extracted.
  • This would show up as a correctness issue where for a field instead of picking the right values we would have ended up giving null output.

Does this PR introduce any user-facing change?

Yes. The query output would change for the queries of the following type:

SELECT
  from_json('[{"a": '||id||', "b": '|| (2*id) ||'}]', 'array<struct<a: INT, b: INT>>').a,
  from_json('[{"a": '||id||', "b": '|| (2*id) ||'}]', 'array<struct<a: INT, b: INT>>').A
FROM
  range(3) as t

Earlier, the result would had been:

Array([ArraySeq(0),ArraySeq(null)], [ArraySeq(1),ArraySeq(null)], [ArraySeq(2),ArraySeq(null)])

vs the new result is (verified through spark-shell):

Array([ArraySeq(0),ArraySeq(0)], [ArraySeq(1),ArraySeq(1)], [ArraySeq(2),ArraySeq(2)])

How was this patch tested?

  • Added unit tests.
  • Without this change, the added test would fail as we would have modified the schema from a to A:
- SPARK-49743: prune unnecessary columns from GetArrayStructFields does not change schema *** FAILED ***                                                                   
  == FAIL: Plans do not match ===                                                                                                                                          
  !Project [from_json(ArrayType(StructType(StructField(A,IntegerType,true)),true), json#0, Some(America/Los_Angeles)).A AS a#0]   Project [from_json(ArrayType(StructType(S
tructField(a,IntegerType,true)),true), json#0, Some(America/Los_Angeles)).A AS a#0]                                                                                        
   +- LocalRelation <empty>, [json#0]                                                                                             +- LocalRelation <empty>, [json#0] (PlanT
est.scala:179)

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Sep 20, 2024
@nikhilsheoran-db
Copy link
Contributor Author

cc: @cloud-fan @dbatomic to take a look.

@HyukjinKwon
Copy link
Member

cc @viirya

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants