19683: Ensure null inputs to array setop functions return null output #171

martin-augment · 2026-01-08T06:09:48Z

19683: To review by AI

coderabbitai · 2026-01-08T06:10:21Z

Walkthrough

This pull request modifies null-handling behavior in DataFusion's array set operations (union, intersect, except). The changes update return type inference to handle null inputs more consistently, introduce null buffer union operations for tracking nullability across combined arrays, and refactor internal logic to replace per-element null scaffolding with early-continue approaches. Test expectations are updated to reflect changed behavior where certain null input scenarios now return null instead of empty lists. Documentation for array_union is corrected. Overall, approximately 112 lines are modified across implementation and test files to standardize null semantics in nested array operations.

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-01-08T06:11:32Z

Code Review: PR #171 - Ensure null inputs to array setop functions return null output

Summary

This PR addresses issue apache#19683 by ensuring that array set operation functions (array_union, array_intersect, array_except) return null when any input is null, which is the correct SQL behavior for handling null values.

Code Quality & Best Practices ✅

Strengths:

Clean refactoring that consolidates null handling logic across all three set operation functions
Good use of pattern matching to handle different data type combinations
Proper use of NullBuffer::union() to correctly propagate null values
Code is more maintainable after removing the empty_array() helper and 60+ lines of repetitive null-handling code

Minor Improvements:

set_ops.rs:367-376 - The null handling in generic_set_lists is clean, but consider adding a comment explaining why we skip processing when either input is null:
```
// Skip processing if either input is null - output will be null
_ => {
    offsets.push(last_offset);
    continue;
}
```

except.rs:194 - Consider extracting the null check to improve readability:

if nulls.as_ref().is_some_and(|nulls| nulls.is_null(i)) {

Could be:

let is_null = nulls.as_ref().map_or(false, |n| n.is_null(i));
if is_null {

Potential Bugs & Issues ✅

No critical bugs found. The implementation is sound:

Null handling is consistent across all three functions
The use of NullBuffer::union() correctly combines null bitmaps from both inputs
Edge cases (Null+Null, Null+List, List+Null) are properly handled
Return types correctly match input types when dealing with nulls

One minor concern:

except.rs:187-192 - The use of tuple_windows() with enumerate() is correct, but ensure that the index i properly corresponds to the row index for null checking. This appears correct since we're enumerating the zip of both offset iterators.

Performance Considerations ✅

Positive changes:

Reduced allocations - Removed unnecessary new_null_buf vector building in generic_set_lists (set_ops.rs:~400)
Early exit optimization - When inputs are null, we now immediately return null arrays without any processing
Simplified code path - Removed the empty_array() helper and complex conditional logic, reducing branching

No performance regressions identified:

The NullBuffer::union() operation is efficient and commonly used in Arrow
Iterator usage with tuple_windows() is zero-cost abstraction
The null check inside the loop (except.rs:194) adds minimal overhead and only executes when nulls are present

Test Coverage ✅

Excellent test coverage:

Added comprehensive tests for all three functions with explicit null type casting
Tests cover all combinations: null + array, array + null, and null + null
Tests verify behavior with both List and LargeList types
30+ new test cases added across the three functions
Existing table-based tests also validate null handling with real data

Test cases added:

array_union(arrow_cast(null, 'List(Int64)'), [1, 2]) → NULL
array_intersect([1, 2], arrow_cast(null, 'List(Int64)')) → NULL
array_except(arrow_cast(null, 'List(Int64)'), [1, 2]) → NULL
And their symmetric counterparts

Security Concerns ✅

No security issues identified:

No unsafe code introduced
Proper error handling maintained
No potential for panics (all array accesses are safe)
No input validation bypassed

Additional Observations

Documentation:

Fixed typo in set_ops.rs:71: "with out" → "without" ✅

Code organization:

The consolidation of null handling logic from 60+ lines to ~5 lines per function is excellent
Pattern matching is more idiomatic Rust than the previous if-else chains

Recommendation

LGTM - Approve ✅

This is a well-implemented fix that:

Correctly implements SQL null semantics
Simplifies and improves code maintainability
Has comprehensive test coverage
Improves performance by eliminating unnecessary work
Contains no breaking changes

The changes are focused, well-tested, and improve both correctness and code quality.

augmentcode · 2026-01-08T06:15:36Z

🤖 Augment PR Summary

Summary: This PR fixes array set-operation UDFs to follow SQL null-propagation semantics: if any input is NULL, the result is NULL.

Changes:

Update array_except type inference and execution to return a NULL list output when either argument is NULL (including the NULL+NULL case).
Update array_union / array_intersect set-op execution to return NULL arrays for NULL inputs, removing prior “treat NULL as empty” behavior.
Propagate per-row list nulls by combining input null buffers (so a NULL element in either list argument yields a NULL element in the output).
Handle all-null inputs in set-ops by producing an empty values array when concatenation would otherwise have no inputs.
Refresh sqllogictest expectations to assert NULL outputs for NULL inputs, and add explicit typed-NULL list cases via arrow_cast(null, 'List(...)').
Fix minor documentation wording (“with out” → “without”).

Technical Notes: Uses Arrow’s new_null_array and NullBuffer::union to ensure correct nullability at both the array and nested-list element levels.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In @datafusion/functions-nested/src/set_ops.rs:
- Around line 69-71: The docstring for array_union incorrectly describes
intersection; update the user_doc description on the array_union function to
reflect union semantics by replacing "Returns an array of elements that are
present in both arrays" with wording like "Returns an array of elements that are
present in either array (all elements from both arrays) without duplicates,"
ensuring the parenthetical note remains and the description unambiguously states
union behavior.

In @docs/source/user-guide/sql/scalar_functions.md:
- Line 4214: The description for the scalar function array_union is inconsistent
(says "present in both arrays" but means union); update the
docstring/description for array_union to read something like: "Returns an array
containing all unique elements from both input arrays, without duplicates,"
ensuring it clearly states it returns the union of both arrays (all unique
elements) rather than the intersection. Make this change where the array_union
function description appears to replace the existing contradictory sentence.

🧹 Nitpick comments (2)

datafusion/sqllogictest/test_files/array.slt (1)

4841-4870: Good addition of explicit List-typed NULL coverage for set ops

The new tests using arrow_cast(null, 'List(Int64)') for array_union, array_intersect, and array_except correctly assert NULL results and close a subtle gap where the NULL is already list-typed rather than inferred. Solid extension of coverage.

If you want to go one step further, you could mirror these with LargeList(Int64) / FixedSizeList casts to ensure future refactors keep NULL propagation identical across all list encodings.
datafusion/functions-nested/src/except.rs (1)
185-212: Consider renaming inner loop variables to avoid shadowing.

The loop variable i in lines 199 and 203 shadows the outer i from the enumerate on line 192. While this works correctly, it reduces readability. Consider using distinct names like r_idx and l_idx for the inner loops.
♻️ Suggested improvement
-        for i in r_start.as_usize()..r_end.as_usize() {
-            let right_row = r_values.row(i);
+        for r_idx in r_start.as_usize()..r_end.as_usize() {
+            let right_row = r_values.row(r_idx);
             dedup.insert(right_row);
         }
-        for i in l_start.as_usize()..l_end.as_usize() {
-            let left_row = l_values.row(i);
+        for l_idx in l_start.as_usize()..l_end.as_usize() {
+            let left_row = l_values.row(l_idx);
             if dedup.insert(left_row) {
                 rows.push(left_row);
             }
         }

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 02e7cbb and 455043d.

📒 Files selected for processing (4)

datafusion/functions-nested/src/except.rs
datafusion/functions-nested/src/set_ops.rs
datafusion/sqllogictest/test_files/array.slt
docs/source/user-guide/sql/scalar_functions.md

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-11-06T13:37:46.028Z

Learnt from: martin-augment
Repo: martin-augment/datafusion PR: 5
File: datafusion/core/tests/user_defined/user_defined_scalar_functions.rs:997-1005
Timestamp: 2025-11-06T13:37:46.028Z
Learning: In datafusion/core/tests/user_defined/user_defined_scalar_functions.rs, when implementing ScalarFunctionWrapper::replacement, always bounds-check placeholder_position against defaults.len() before indexing into the defaults array to prevent panics when SQL UDF bodies reference placeholders like $N where N exceeds the declared parameter count.

Applied to files:

datafusion/functions-nested/src/except.rs

🧬 Code graph analysis (1)

datafusion/functions-nested/src/except.rs (2)

datafusion/functions-nested/src/utils.rs (1)

check_datatypes (36-50)

datafusion/common/src/scalar/mod.rs (1)

new_list (2757-2772)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Cursor Bugbot
GitHub Check: claude-review
GitHub Check: Analyze (rust)

🔇 Additional comments (14)

datafusion/sqllogictest/test_files/array.slt (3)

4771-4807: array_union now consistently propagates NULL inputs

These updated expectations (NULL whenever either argument is NULL, including null vs [] and null vs non-empty arrays) align with the PR goal and keep semantics uniform across scalar and LargeList variants. Looks good.

6919-6966: array_intersect NULL vs empty-array combinations are now consistent

The changed expectations for array_intersect with null on either side (including []/LargeList([]) vs null) now all return NULL, while non-NULL arrays containing NULL elements still return concrete arrays. This is consistent with the new “any NULL input → NULL output” rule for set operations.

7502-7615: array_except NULL-handling across typed tables is aligned with new semantics

For int, nested-list, utf8, bool, and [] vs null cases, array_except now returns NULL whenever either input array is NULL (including column values and scalar arguments), while purely non-NULL inputs keep their previous outputs. This matches the intended behavior and the patterns used for array_union and array_intersect.

datafusion/functions-nested/src/except.rs (4)

18-34: LGTM - Import additions support the new null handling.

The new imports for new_null_array, NullBuffer, and Itertools are appropriately added to support the enhanced null handling and offset iteration logic.

108-116: Verify return type semantics for (Null, dt) case.

For array_except(A, B), the result contains elements from A not in B. When A is Null, the expected semantic is that the result should be null (not the type of B).

The current implementation returns dt.clone() for (Null, dt), which means if the first argument is Null and the second is List<Int32>, the return type will be List<Int32>. This is consistent with set_ops.rs, but confirm this is the intended behavior for type inference purposes (the actual value will still be null as handled in array_except_inner).

137-146: LGTM - Null input handling is consistent with return type inference.

The implementation correctly creates null arrays of appropriate types and lengths for all null input combinations. This aligns with the PR objective of ensuring null inputs produce null outputs.

214-223: LGTM - Null buffer correctly propagated to final result.

The computed nulls buffer from NullBuffer::union is correctly passed to the GenericListArray::new constructor, ensuring null rows are properly represented in the output.

datafusion/functions-nested/src/set_ops.rs (7)

21-23: LGTM - Import additions for null and empty array handling.

134-141: LGTM - Return type inference consistent with other set operations.

218-225: LGTM - Consistent null handling in return type inference.

364-376: LGTM - Clean null handling with early continue pattern.

The refactored iteration correctly handles null list elements by preserving offsets without processing, which maintains correct array structure while skipping null rows.

415-419: LGTM - Proper handling of empty result arrays.

The check for empty new_arrays_ref before calling compute::concat prevents potential issues and correctly produces an empty array of the appropriate type.

420-426: LGTM - Null buffer correctly computed and propagated.

Using NullBuffer::union ensures that rows where either input is null are correctly marked as null in the output.

429-440: LGTM - Null input handling consistent with array_except.

The null handling in general_set_op follows the same pattern as array_except_inner, ensuring consistent behavior across all array set operations.

coderabbitai · 2026-01-08T06:16:14Z

datafusion/functions-nested/src/set_ops.rs

 #[user_doc(
    doc_section(label = "Array Functions"),
-    description = "Returns an array of elements that are present in both arrays (all elements from both arrays) with out duplicates.",
+    description = "Returns an array of elements that are present in both arrays (all elements from both arrays) without duplicates.",


⚠️ Potential issue | 🟡 Minor

Documentation describes intersection semantics instead of union.

The description "Returns an array of elements that are present in both arrays" describes intersection behavior, but this is array_union. The parenthetical "(all elements from both arrays)" is correct, but the initial phrase is misleading.

📝 Suggested fix

- description = "Returns an array of elements that are present in both arrays (all elements from both arrays) without duplicates.", + description = "Returns an array of all unique elements from both arrays without duplicates.",

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#[user_doc(

doc_section(label = "Array Functions"),

description = "Returns an array of elements that are present in both arrays (all elements from both arrays) with out duplicates.",

description = "Returns an array of elements that are present in both arrays (all elements from both arrays) without duplicates.",

#[user_doc(

doc_section(label = "Array Functions"),

description = "Returns an array of all unique elements from both arrays without duplicates.",

🤖 Prompt for AI Agents

In @datafusion/functions-nested/src/set_ops.rs around lines 69 - 71, The docstring for array_union incorrectly describes intersection; update the user_doc description on the array_union function to reflect union semantics by replacing "Returns an array of elements that are present in both arrays" with wording like "Returns an array of elements that are present in either array (all elements from both arrays) without duplicates," ensuring the parenthetical note remains and the description unambiguously states union behavior.

value:good-to-have; category:documentation; feedback:The CodeRabbit AI reviewer is correct! The documentation for array_union() is a bit confusing and it could be improved to not sound like it is about interception.

coderabbitai · 2026-01-08T06:16:14Z

docs/source/user-guide/sql/scalar_functions.md

 ### `array_union`

-Returns an array of elements that are present in both arrays (all elements from both arrays) with out duplicates.
+Returns an array of elements that are present in both arrays (all elements from both arrays) without duplicates.


⚠️ Potential issue | 🟡 Minor

Clarify array_union semantics in the description.

The current description contains a logical inconsistency: "elements that are present in both arrays" describes intersection, but "(all elements from both arrays)" describes union. The function should return all unique elements from both arrays combined, not just the elements common to both.

Suggested fix: "Returns an array containing all unique elements from both input arrays, without duplicates."

🤖 Prompt for AI Agents

In @docs/source/user-guide/sql/scalar_functions.md at line 4214, The description for the scalar function array_union is inconsistent (says "present in both arrays" but means union); update the docstring/description for array_union to read something like: "Returns an array containing all unique elements from both input arrays, without duplicates," ensuring it clearly states it returns the union of both arrays (all unique elements) rather than the intersection. Make this change where the array_union function description appears to replace the existing contradictory sentence.

value:good-to-have; category:documentation; feedback:The CodeRabbit AI reviewer is correct! The documentation for array_union() is a bit confusing and it could be improved to not sound like it is about interception.

martin-augment · 2026-01-08T06:43:11Z

185-212: Consider renaming inner loop variables to avoid shadowing.

The loop variable i in lines 199 and 203 shadows the outer i from the enumerate on line 192. While this works correctly, it reduces readability. Consider using distinct names like r_idx and l_idx for the inner loops.

value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct! There are three loops (one outer and two inner ones) that use i named variable. Prevents making the code more complex to read and follow due to the reuse of the variable name.

Jefffrey added 2 commits January 7, 2026 22:43

Ensure null inputs to array setop functions return null output

1581857

doc update

455043d

augmentcode bot reviewed Jan 8, 2026

View reviewed changes

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

19683: Ensure null inputs to array setop functions return null output #171

19683: Ensure null inputs to array setop functions return null output #171

martin-augment commented Jan 8, 2026

Uh oh!

coderabbitai bot commented Jan 8, 2026 •

edited

Loading

Uh oh!

claude bot commented Jan 8, 2026

Uh oh!

augmentcode bot commented Jan 8, 2026

Uh oh!

augmentcode bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 8, 2026

Uh oh!

martin-augment Jan 8, 2026

Uh oh!

coderabbitai bot Jan 8, 2026

Uh oh!

martin-augment Jan 8, 2026

Uh oh!

martin-augment commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

19683: Ensure null inputs to array setop functions return null output #171

Are you sure you want to change the base?

19683: Ensure null inputs to array setop functions return null output #171

Conversation

martin-augment commented Jan 8, 2026

Uh oh!

coderabbitai bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

claude bot commented Jan 8, 2026

Code Review: PR #171 - Ensure null inputs to array setop functions return null output

Summary

Code Quality & Best Practices ✅

Potential Bugs & Issues ✅

Performance Considerations ✅

Test Coverage ✅

Security Concerns ✅

Additional Observations

Recommendation

Uh oh!

augmentcode bot commented Jan 8, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai bot commented Jan 8, 2026 •

edited

Loading