-
Notifications
You must be signed in to change notification settings - Fork 0
19683: Ensure null inputs to array setop functions return null output #171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThis pull request modifies null-handling behavior in DataFusion's array set operations (union, intersect, except). The changes update return type inference to handle null inputs more consistently, introduce null buffer union operations for tracking nullability across combined arrays, and refactor internal logic to replace per-element null scaffolding with early-continue approaches. Test expectations are updated to reflect changed behavior where certain null input scenarios now return null instead of empty lists. Documentation for array_union is corrected. Overall, approximately 112 lines are modified across implementation and test files to standardize null semantics in nested array operations. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review: PR #171 - Ensure null inputs to array setop functions return null outputSummaryThis PR addresses issue apache#19683 by ensuring that array set operation functions ( Code Quality & Best Practices ✅Strengths:
Minor Improvements:
Potential Bugs & Issues ✅No critical bugs found. The implementation is sound:
One minor concern:
Performance Considerations ✅Positive changes:
No performance regressions identified:
Test Coverage ✅Excellent test coverage:
Test cases added:
Security Concerns ✅No security issues identified:
Additional ObservationsDocumentation:
Code organization:
RecommendationLGTM - Approve ✅ This is a well-implemented fix that:
The changes are focused, well-tested, and improve both correctness and code quality. |
🤖 Augment PR SummarySummary: This PR fixes array set-operation UDFs to follow SQL null-propagation semantics: if any input is NULL, the result is NULL. Changes:
Technical Notes: Uses Arrow’s 🤖 Was this summary useful? React with 👍 or 👎 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review completed. No suggestions at this time.
Comment augment review to trigger a new review at any time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @datafusion/functions-nested/src/set_ops.rs:
- Around line 69-71: The docstring for array_union incorrectly describes
intersection; update the user_doc description on the array_union function to
reflect union semantics by replacing "Returns an array of elements that are
present in both arrays" with wording like "Returns an array of elements that are
present in either array (all elements from both arrays) without duplicates,"
ensuring the parenthetical note remains and the description unambiguously states
union behavior.
In @docs/source/user-guide/sql/scalar_functions.md:
- Line 4214: The description for the scalar function array_union is inconsistent
(says "present in both arrays" but means union); update the
docstring/description for array_union to read something like: "Returns an array
containing all unique elements from both input arrays, without duplicates,"
ensuring it clearly states it returns the union of both arrays (all unique
elements) rather than the intersection. Make this change where the array_union
function description appears to replace the existing contradictory sentence.
🧹 Nitpick comments (2)
datafusion/sqllogictest/test_files/array.slt (1)
4841-4870: Good addition of explicit List-typed NULL coverage for set opsThe new tests using
arrow_cast(null, 'List(Int64)')forarray_union,array_intersect, andarray_exceptcorrectly assert NULL results and close a subtle gap where the NULL is already list-typed rather than inferred. Solid extension of coverage.If you want to go one step further, you could mirror these with
LargeList(Int64)/FixedSizeListcasts to ensure future refactors keep NULL propagation identical across all list encodings.datafusion/functions-nested/src/except.rs (1)
185-212: Consider renaming inner loop variables to avoid shadowing.The loop variable
iin lines 199 and 203 shadows the outerifrom the enumerate on line 192. While this works correctly, it reduces readability. Consider using distinct names liker_idxandl_idxfor the inner loops.♻️ Suggested improvement
- for i in r_start.as_usize()..r_end.as_usize() { - let right_row = r_values.row(i); + for r_idx in r_start.as_usize()..r_end.as_usize() { + let right_row = r_values.row(r_idx); dedup.insert(right_row); } - for i in l_start.as_usize()..l_end.as_usize() { - let left_row = l_values.row(i); + for l_idx in l_start.as_usize()..l_end.as_usize() { + let left_row = l_values.row(l_idx); if dedup.insert(left_row) { rows.push(left_row); } }
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
datafusion/functions-nested/src/except.rsdatafusion/functions-nested/src/set_ops.rsdatafusion/sqllogictest/test_files/array.sltdocs/source/user-guide/sql/scalar_functions.md
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-06T13:37:46.028Z
Learnt from: martin-augment
Repo: martin-augment/datafusion PR: 5
File: datafusion/core/tests/user_defined/user_defined_scalar_functions.rs:997-1005
Timestamp: 2025-11-06T13:37:46.028Z
Learning: In datafusion/core/tests/user_defined/user_defined_scalar_functions.rs, when implementing ScalarFunctionWrapper::replacement, always bounds-check placeholder_position against defaults.len() before indexing into the defaults array to prevent panics when SQL UDF bodies reference placeholders like $N where N exceeds the declared parameter count.
Applied to files:
datafusion/functions-nested/src/except.rs
🧬 Code graph analysis (1)
datafusion/functions-nested/src/except.rs (2)
datafusion/functions-nested/src/utils.rs (1)
check_datatypes(36-50)datafusion/common/src/scalar/mod.rs (1)
new_list(2757-2772)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Cursor Bugbot
- GitHub Check: claude-review
- GitHub Check: Analyze (rust)
🔇 Additional comments (14)
datafusion/sqllogictest/test_files/array.slt (3)
4771-4807: array_union now consistently propagates NULL inputsThese updated expectations (NULL whenever either argument is NULL, including
nullvs[]andnullvs non-empty arrays) align with the PR goal and keep semantics uniform across scalar and LargeList variants. Looks good.
6919-6966: array_intersect NULL vs empty-array combinations are now consistentThe changed expectations for
array_intersectwithnullon either side (including[]/LargeList([])vsnull) now all return NULL, while non-NULL arrays containing NULL elements still return concrete arrays. This is consistent with the new “any NULL input → NULL output” rule for set operations.
7502-7615: array_except NULL-handling across typed tables is aligned with new semanticsFor int, nested-list, utf8, bool, and
[]vsnullcases,array_exceptnow returns NULL whenever either input array is NULL (including column values and scalar arguments), while purely non-NULL inputs keep their previous outputs. This matches the intended behavior and the patterns used forarray_unionandarray_intersect.datafusion/functions-nested/src/except.rs (4)
18-34: LGTM - Import additions support the new null handling.The new imports for
new_null_array,NullBuffer, andItertoolsare appropriately added to support the enhanced null handling and offset iteration logic.
108-116: Verify return type semantics for(Null, dt)case.For
array_except(A, B), the result contains elements from A not in B. When A isNull, the expected semantic is that the result should be null (not the type of B).The current implementation returns
dt.clone()for(Null, dt), which means if the first argument isNulland the second isList<Int32>, the return type will beList<Int32>. This is consistent withset_ops.rs, but confirm this is the intended behavior for type inference purposes (the actual value will still be null as handled inarray_except_inner).
137-146: LGTM - Null input handling is consistent with return type inference.The implementation correctly creates null arrays of appropriate types and lengths for all null input combinations. This aligns with the PR objective of ensuring null inputs produce null outputs.
214-223: LGTM - Null buffer correctly propagated to final result.The computed
nullsbuffer fromNullBuffer::unionis correctly passed to theGenericListArray::newconstructor, ensuring null rows are properly represented in the output.datafusion/functions-nested/src/set_ops.rs (7)
21-23: LGTM - Import additions for null and empty array handling.
134-141: LGTM - Return type inference consistent with other set operations.
218-225: LGTM - Consistent null handling in return type inference.
364-376: LGTM - Clean null handling with early continue pattern.The refactored iteration correctly handles null list elements by preserving offsets without processing, which maintains correct array structure while skipping null rows.
415-419: LGTM - Proper handling of empty result arrays.The check for empty
new_arrays_refbefore callingcompute::concatprevents potential issues and correctly produces an empty array of the appropriate type.
420-426: LGTM - Null buffer correctly computed and propagated.Using
NullBuffer::unionensures that rows where either input is null are correctly marked as null in the output.
429-440: LGTM - Null input handling consistent with array_except.The null handling in
general_set_opfollows the same pattern asarray_except_inner, ensuring consistent behavior across all array set operations.
| #[user_doc( | ||
| doc_section(label = "Array Functions"), | ||
| description = "Returns an array of elements that are present in both arrays (all elements from both arrays) with out duplicates.", | ||
| description = "Returns an array of elements that are present in both arrays (all elements from both arrays) without duplicates.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation describes intersection semantics instead of union.
The description "Returns an array of elements that are present in both arrays" describes intersection behavior, but this is array_union. The parenthetical "(all elements from both arrays)" is correct, but the initial phrase is misleading.
📝 Suggested fix
- description = "Returns an array of elements that are present in both arrays (all elements from both arrays) without duplicates.",
+ description = "Returns an array of all unique elements from both arrays without duplicates.",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #[user_doc( | |
| doc_section(label = "Array Functions"), | |
| description = "Returns an array of elements that are present in both arrays (all elements from both arrays) with out duplicates.", | |
| description = "Returns an array of elements that are present in both arrays (all elements from both arrays) without duplicates.", | |
| #[user_doc( | |
| doc_section(label = "Array Functions"), | |
| description = "Returns an array of all unique elements from both arrays without duplicates.", |
🤖 Prompt for AI Agents
In @datafusion/functions-nested/src/set_ops.rs around lines 69 - 71, The
docstring for array_union incorrectly describes intersection; update the
user_doc description on the array_union function to reflect union semantics by
replacing "Returns an array of elements that are present in both arrays" with
wording like "Returns an array of elements that are present in either array (all
elements from both arrays) without duplicates," ensuring the parenthetical note
remains and the description unambiguously states union behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:good-to-have; category:documentation; feedback:The CodeRabbit AI reviewer is correct! The documentation for array_union() is a bit confusing and it could be improved to not sound like it is about interception.
| ### `array_union` | ||
|
|
||
| Returns an array of elements that are present in both arrays (all elements from both arrays) with out duplicates. | ||
| Returns an array of elements that are present in both arrays (all elements from both arrays) without duplicates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify array_union semantics in the description.
The current description contains a logical inconsistency: "elements that are present in both arrays" describes intersection, but "(all elements from both arrays)" describes union. The function should return all unique elements from both arrays combined, not just the elements common to both.
Suggested fix: "Returns an array containing all unique elements from both input arrays, without duplicates."
🤖 Prompt for AI Agents
In @docs/source/user-guide/sql/scalar_functions.md at line 4214, The description
for the scalar function array_union is inconsistent (says "present in both
arrays" but means union); update the docstring/description for array_union to
read something like: "Returns an array containing all unique elements from both
input arrays, without duplicates," ensuring it clearly states it returns the
union of both arrays (all unique elements) rather than the intersection. Make
this change where the array_union function description appears to replace the
existing contradictory sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:good-to-have; category:documentation; feedback:The CodeRabbit AI reviewer is correct! The documentation for array_union() is a bit confusing and it could be improved to not sound like it is about interception.
value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct! There are three loops (one outer and two inner ones) that use |
19683: To review by AI