[SPARK-49743][SQL] OptimizeCsvJsonExpr should not change schema fields when pruning GetArrayStructFields #48190

nikhilsheoran-db · 2024-09-20T19:19:21Z

What changes were proposed in this pull request?

When pruning the schema of the struct in GetArrayStructFields, rely on the existing StructType to obtain the pruned schema instead of using the accessed field.

Why are the changes needed?

Fixes a bug in OptimizeCsvJsonExprs rule that would have otherwise changed the schema fields of the underlying struct to be extracted.
This would show up as a correctness issue where for a field instead of picking the right values we would have ended up giving null output.

Does this PR introduce any user-facing change?

Yes. The query output would change for the queries of the following type:

SELECT
  from_json('[{"a": '||id||', "b": '|| (2*id) ||'}]', 'array<struct<a: INT, b: INT>>').a,
  from_json('[{"a": '||id||', "b": '|| (2*id) ||'}]', 'array<struct<a: INT, b: INT>>').A
FROM
  range(3) as t

Earlier, the result would had been:

Array([ArraySeq(0),ArraySeq(null)], [ArraySeq(1),ArraySeq(null)], [ArraySeq(2),ArraySeq(null)])

vs the new result is (verified through spark-shell):

Array([ArraySeq(0),ArraySeq(0)], [ArraySeq(1),ArraySeq(1)], [ArraySeq(2),ArraySeq(2)])

How was this patch tested?

Added unit tests.
Without this change, the added test would fail as we would have modified the schema from a to A:

- SPARK-49743: prune unnecessary columns from GetArrayStructFields does not change schema *** FAILED ***                                                                   
  == FAIL: Plans do not match ===                                                                                                                                          
  !Project [from_json(ArrayType(StructType(StructField(A,IntegerType,true)),true), json#0, Some(America/Los_Angeles)).A AS a#0]   Project [from_json(ArrayType(StructType(S
tructField(a,IntegerType,true)),true), json#0, Some(America/Los_Angeles)).A AS a#0]                                                                                        
   +- LocalRelation <empty>, [json#0]                                                                                             +- LocalRelation <empty>, [json#0] (PlanT
est.scala:179)

Was this patch authored or co-authored using generative AI tooling?

No.

nikhilsheoran-db · 2024-09-20T23:28:32Z

cc: @cloud-fan @dbatomic to take a look.

HyukjinKwon · 2024-09-21T03:23:52Z

cc @viirya

nikhilsheoran-db added 2 commits September 20, 2024 12:06

Add test

b03be2e

Add fix

13a4df7

github-actions bot added the SQL label Sep 20, 2024

nikhilsheoran-db added 2 commits September 20, 2024 15:11

Fix build error

39f07da

Fix indentation

9ce3857

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-49743][SQL] OptimizeCsvJsonExpr should not change schema fields when pruning GetArrayStructFields #48190

[SPARK-49743][SQL] OptimizeCsvJsonExpr should not change schema fields when pruning GetArrayStructFields #48190

nikhilsheoran-db commented Sep 20, 2024 •

edited

Loading

nikhilsheoran-db commented Sep 20, 2024

HyukjinKwon commented Sep 21, 2024

[SPARK-49743][SQL] OptimizeCsvJsonExpr should not change schema fields when pruning GetArrayStructFields #48190

Are you sure you want to change the base?

[SPARK-49743][SQL] OptimizeCsvJsonExpr should not change schema fields when pruning GetArrayStructFields #48190

Conversation

nikhilsheoran-db commented Sep 20, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

nikhilsheoran-db commented Sep 20, 2024

HyukjinKwon commented Sep 21, 2024

nikhilsheoran-db commented Sep 20, 2024 •

edited

Loading