Skip to content

Conversation

@martin-augment
Copy link
Owner

19683: To review by AI

@coderabbitai
Copy link

coderabbitai bot commented Jan 8, 2026

Walkthrough

This pull request modifies null-handling behavior in DataFusion's array set operations (union, intersect, except). The changes update return type inference to handle null inputs more consistently, introduce null buffer union operations for tracking nullability across combined arrays, and refactor internal logic to replace per-element null scaffolding with early-continue approaches. Test expectations are updated to reflect changed behavior where certain null input scenarios now return null instead of empty lists. Documentation for array_union is corrected. Overall, approximately 112 lines are modified across implementation and test files to standardize null semantics in nested array operations.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Jan 8, 2026

Code Review: PR #171 - Ensure null inputs to array setop functions return null output

Summary

This PR addresses issue apache#19683 by ensuring that array set operation functions (array_union, array_intersect, array_except) return null when any input is null, which is the correct SQL behavior for handling null values.

Code Quality & Best Practices ✅

Strengths:

  • Clean refactoring that consolidates null handling logic across all three set operation functions
  • Good use of pattern matching to handle different data type combinations
  • Proper use of NullBuffer::union() to correctly propagate null values
  • Code is more maintainable after removing the empty_array() helper and 60+ lines of repetitive null-handling code

Minor Improvements:

  1. set_ops.rs:367-376 - The null handling in generic_set_lists is clean, but consider adding a comment explaining why we skip processing when either input is null:

    // Skip processing if either input is null - output will be null
    _ => {
        offsets.push(last_offset);
        continue;
    }
  2. except.rs:194 - Consider extracting the null check to improve readability:

    if nulls.as_ref().is_some_and(|nulls| nulls.is_null(i)) {

    Could be:

    let is_null = nulls.as_ref().map_or(false, |n| n.is_null(i));
    if is_null {

Potential Bugs & Issues ✅

No critical bugs found. The implementation is sound:

  • Null handling is consistent across all three functions
  • The use of NullBuffer::union() correctly combines null bitmaps from both inputs
  • Edge cases (Null+Null, Null+List, List+Null) are properly handled
  • Return types correctly match input types when dealing with nulls

One minor concern:

  • except.rs:187-192 - The use of tuple_windows() with enumerate() is correct, but ensure that the index i properly corresponds to the row index for null checking. This appears correct since we're enumerating the zip of both offset iterators.

Performance Considerations ✅

Positive changes:

  1. Reduced allocations - Removed unnecessary new_null_buf vector building in generic_set_lists (set_ops.rs:~400)
  2. Early exit optimization - When inputs are null, we now immediately return null arrays without any processing
  3. Simplified code path - Removed the empty_array() helper and complex conditional logic, reducing branching

No performance regressions identified:

  • The NullBuffer::union() operation is efficient and commonly used in Arrow
  • Iterator usage with tuple_windows() is zero-cost abstraction
  • The null check inside the loop (except.rs:194) adds minimal overhead and only executes when nulls are present

Test Coverage ✅

Excellent test coverage:

  • Added comprehensive tests for all three functions with explicit null type casting
  • Tests cover all combinations: null + array, array + null, and null + null
  • Tests verify behavior with both List and LargeList types
  • 30+ new test cases added across the three functions
  • Existing table-based tests also validate null handling with real data

Test cases added:

  • array_union(arrow_cast(null, 'List(Int64)'), [1, 2]) → NULL
  • array_intersect([1, 2], arrow_cast(null, 'List(Int64)')) → NULL
  • array_except(arrow_cast(null, 'List(Int64)'), [1, 2]) → NULL
  • And their symmetric counterparts

Security Concerns ✅

No security issues identified:

  • No unsafe code introduced
  • Proper error handling maintained
  • No potential for panics (all array accesses are safe)
  • No input validation bypassed

Additional Observations

Documentation:

  • Fixed typo in set_ops.rs:71: "with out" → "without" ✅

Code organization:

  • The consolidation of null handling logic from 60+ lines to ~5 lines per function is excellent
  • Pattern matching is more idiomatic Rust than the previous if-else chains

Recommendation

LGTM - Approve ✅

This is a well-implemented fix that:

  1. Correctly implements SQL null semantics
  2. Simplifies and improves code maintainability
  3. Has comprehensive test coverage
  4. Improves performance by eliminating unnecessary work
  5. Contains no breaking changes

The changes are focused, well-tested, and improve both correctness and code quality.

@augmentcode
Copy link

augmentcode bot commented Jan 8, 2026

🤖 Augment PR Summary

Summary: This PR fixes array set-operation UDFs to follow SQL null-propagation semantics: if any input is NULL, the result is NULL.

Changes:

  • Update array_except type inference and execution to return a NULL list output when either argument is NULL (including the NULL+NULL case).
  • Update array_union / array_intersect set-op execution to return NULL arrays for NULL inputs, removing prior “treat NULL as empty” behavior.
  • Propagate per-row list nulls by combining input null buffers (so a NULL element in either list argument yields a NULL element in the output).
  • Handle all-null inputs in set-ops by producing an empty values array when concatenation would otherwise have no inputs.
  • Refresh sqllogictest expectations to assert NULL outputs for NULL inputs, and add explicit typed-NULL list cases via arrow_cast(null, 'List(...)').
  • Fix minor documentation wording (“with out” → “without”).

Technical Notes: Uses Arrow’s new_null_array and NullBuffer::union to ensure correct nullability at both the array and nested-list element levels.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @datafusion/functions-nested/src/set_ops.rs:
- Around line 69-71: The docstring for array_union incorrectly describes
intersection; update the user_doc description on the array_union function to
reflect union semantics by replacing "Returns an array of elements that are
present in both arrays" with wording like "Returns an array of elements that are
present in either array (all elements from both arrays) without duplicates,"
ensuring the parenthetical note remains and the description unambiguously states
union behavior.

In @docs/source/user-guide/sql/scalar_functions.md:
- Line 4214: The description for the scalar function array_union is inconsistent
(says "present in both arrays" but means union); update the
docstring/description for array_union to read something like: "Returns an array
containing all unique elements from both input arrays, without duplicates,"
ensuring it clearly states it returns the union of both arrays (all unique
elements) rather than the intersection. Make this change where the array_union
function description appears to replace the existing contradictory sentence.
🧹 Nitpick comments (2)
datafusion/sqllogictest/test_files/array.slt (1)

4841-4870: Good addition of explicit List-typed NULL coverage for set ops

The new tests using arrow_cast(null, 'List(Int64)') for array_union, array_intersect, and array_except correctly assert NULL results and close a subtle gap where the NULL is already list-typed rather than inferred. Solid extension of coverage.

If you want to go one step further, you could mirror these with LargeList(Int64) / FixedSizeList casts to ensure future refactors keep NULL propagation identical across all list encodings.

datafusion/functions-nested/src/except.rs (1)

185-212: Consider renaming inner loop variables to avoid shadowing.

The loop variable i in lines 199 and 203 shadows the outer i from the enumerate on line 192. While this works correctly, it reduces readability. Consider using distinct names like r_idx and l_idx for the inner loops.

♻️ Suggested improvement
-        for i in r_start.as_usize()..r_end.as_usize() {
-            let right_row = r_values.row(i);
+        for r_idx in r_start.as_usize()..r_end.as_usize() {
+            let right_row = r_values.row(r_idx);
             dedup.insert(right_row);
         }
-        for i in l_start.as_usize()..l_end.as_usize() {
-            let left_row = l_values.row(i);
+        for l_idx in l_start.as_usize()..l_end.as_usize() {
+            let left_row = l_values.row(l_idx);
             if dedup.insert(left_row) {
                 rows.push(left_row);
             }
         }
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 02e7cbb and 455043d.

📒 Files selected for processing (4)
  • datafusion/functions-nested/src/except.rs
  • datafusion/functions-nested/src/set_ops.rs
  • datafusion/sqllogictest/test_files/array.slt
  • docs/source/user-guide/sql/scalar_functions.md
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-06T13:37:46.028Z
Learnt from: martin-augment
Repo: martin-augment/datafusion PR: 5
File: datafusion/core/tests/user_defined/user_defined_scalar_functions.rs:997-1005
Timestamp: 2025-11-06T13:37:46.028Z
Learning: In datafusion/core/tests/user_defined/user_defined_scalar_functions.rs, when implementing ScalarFunctionWrapper::replacement, always bounds-check placeholder_position against defaults.len() before indexing into the defaults array to prevent panics when SQL UDF bodies reference placeholders like $N where N exceeds the declared parameter count.

Applied to files:

  • datafusion/functions-nested/src/except.rs
🧬 Code graph analysis (1)
datafusion/functions-nested/src/except.rs (2)
datafusion/functions-nested/src/utils.rs (1)
  • check_datatypes (36-50)
datafusion/common/src/scalar/mod.rs (1)
  • new_list (2757-2772)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: claude-review
  • GitHub Check: Analyze (rust)
🔇 Additional comments (14)
datafusion/sqllogictest/test_files/array.slt (3)

4771-4807: array_union now consistently propagates NULL inputs

These updated expectations (NULL whenever either argument is NULL, including null vs [] and null vs non-empty arrays) align with the PR goal and keep semantics uniform across scalar and LargeList variants. Looks good.


6919-6966: array_intersect NULL vs empty-array combinations are now consistent

The changed expectations for array_intersect with null on either side (including []/LargeList([]) vs null) now all return NULL, while non-NULL arrays containing NULL elements still return concrete arrays. This is consistent with the new “any NULL input → NULL output” rule for set operations.


7502-7615: array_except NULL-handling across typed tables is aligned with new semantics

For int, nested-list, utf8, bool, and [] vs null cases, array_except now returns NULL whenever either input array is NULL (including column values and scalar arguments), while purely non-NULL inputs keep their previous outputs. This matches the intended behavior and the patterns used for array_union and array_intersect.

datafusion/functions-nested/src/except.rs (4)

18-34: LGTM - Import additions support the new null handling.

The new imports for new_null_array, NullBuffer, and Itertools are appropriately added to support the enhanced null handling and offset iteration logic.


108-116: Verify return type semantics for (Null, dt) case.

For array_except(A, B), the result contains elements from A not in B. When A is Null, the expected semantic is that the result should be null (not the type of B).

The current implementation returns dt.clone() for (Null, dt), which means if the first argument is Null and the second is List<Int32>, the return type will be List<Int32>. This is consistent with set_ops.rs, but confirm this is the intended behavior for type inference purposes (the actual value will still be null as handled in array_except_inner).


137-146: LGTM - Null input handling is consistent with return type inference.

The implementation correctly creates null arrays of appropriate types and lengths for all null input combinations. This aligns with the PR objective of ensuring null inputs produce null outputs.


214-223: LGTM - Null buffer correctly propagated to final result.

The computed nulls buffer from NullBuffer::union is correctly passed to the GenericListArray::new constructor, ensuring null rows are properly represented in the output.

datafusion/functions-nested/src/set_ops.rs (7)

21-23: LGTM - Import additions for null and empty array handling.


134-141: LGTM - Return type inference consistent with other set operations.


218-225: LGTM - Consistent null handling in return type inference.


364-376: LGTM - Clean null handling with early continue pattern.

The refactored iteration correctly handles null list elements by preserving offsets without processing, which maintains correct array structure while skipping null rows.


415-419: LGTM - Proper handling of empty result arrays.

The check for empty new_arrays_ref before calling compute::concat prevents potential issues and correctly produces an empty array of the appropriate type.


420-426: LGTM - Null buffer correctly computed and propagated.

Using NullBuffer::union ensures that rows where either input is null are correctly marked as null in the output.


429-440: LGTM - Null input handling consistent with array_except.

The null handling in general_set_op follows the same pattern as array_except_inner, ensuring consistent behavior across all array set operations.

Comment on lines 69 to +71
#[user_doc(
doc_section(label = "Array Functions"),
description = "Returns an array of elements that are present in both arrays (all elements from both arrays) with out duplicates.",
description = "Returns an array of elements that are present in both arrays (all elements from both arrays) without duplicates.",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Documentation describes intersection semantics instead of union.

The description "Returns an array of elements that are present in both arrays" describes intersection behavior, but this is array_union. The parenthetical "(all elements from both arrays)" is correct, but the initial phrase is misleading.

📝 Suggested fix
-    description = "Returns an array of elements that are present in both arrays (all elements from both arrays) without duplicates.",
+    description = "Returns an array of all unique elements from both arrays without duplicates.",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#[user_doc(
doc_section(label = "Array Functions"),
description = "Returns an array of elements that are present in both arrays (all elements from both arrays) with out duplicates.",
description = "Returns an array of elements that are present in both arrays (all elements from both arrays) without duplicates.",
#[user_doc(
doc_section(label = "Array Functions"),
description = "Returns an array of all unique elements from both arrays without duplicates.",
🤖 Prompt for AI Agents
In @datafusion/functions-nested/src/set_ops.rs around lines 69 - 71, The
docstring for array_union incorrectly describes intersection; update the
user_doc description on the array_union function to reflect union semantics by
replacing "Returns an array of elements that are present in both arrays" with
wording like "Returns an array of elements that are present in either array (all
elements from both arrays) without duplicates," ensuring the parenthetical note
remains and the description unambiguously states union behavior.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:documentation; feedback:The CodeRabbit AI reviewer is correct! The documentation for array_union() is a bit confusing and it could be improved to not sound like it is about interception.

### `array_union`

Returns an array of elements that are present in both arrays (all elements from both arrays) with out duplicates.
Returns an array of elements that are present in both arrays (all elements from both arrays) without duplicates.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify array_union semantics in the description.

The current description contains a logical inconsistency: "elements that are present in both arrays" describes intersection, but "(all elements from both arrays)" describes union. The function should return all unique elements from both arrays combined, not just the elements common to both.

Suggested fix: "Returns an array containing all unique elements from both input arrays, without duplicates."

🤖 Prompt for AI Agents
In @docs/source/user-guide/sql/scalar_functions.md at line 4214, The description
for the scalar function array_union is inconsistent (says "present in both
arrays" but means union); update the docstring/description for array_union to
read something like: "Returns an array containing all unique elements from both
input arrays, without duplicates," ensuring it clearly states it returns the
union of both arrays (all unique elements) rather than the intersection. Make
this change where the array_union function description appears to replace the
existing contradictory sentence.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:documentation; feedback:The CodeRabbit AI reviewer is correct! The documentation for array_union() is a bit confusing and it could be improved to not sound like it is about interception.

@martin-augment
Copy link
Owner Author

185-212: Consider renaming inner loop variables to avoid shadowing.

The loop variable i in lines 199 and 203 shadows the outer i from the enumerate on line 192. While this works correctly, it reduces readability. Consider using distinct names like r_idx and l_idx for the inner loops.

value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct! There are three loops (one outer and two inner ones) that use i named variable. Prevents making the code more complex to read and follow due to the reuse of the variable name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants