Skip to content

Conversation

Jefffrey
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Already works for regular List (and FixedSizeList technically, since it gets coerced to List). Extending this support to LargeLists.

What changes are included in this PR?

Uplift ScalarValue::convert_array_to_scalar_vec to be generic across both List & LargeList and enable this simplification for LargeLists inside array_has.

Also, fix ScalarValue::convert_array_to_scalar_vec to properly account for nulls in nested arrays.

Are these changes tested?

Fixed existing tests to have this new behaviour.

Are there any user-facing changes?

No.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Sep 22, 2025
Comment on lines 3308 to 3322
let mut scalars = Vec::with_capacity(array.len());

for index in 0..array.len() {
let nested_array = array.as_list::<i32>().value(index);
let scalar_values = (0..nested_array.len())
.map(|i| ScalarValue::try_from_array(&nested_array, i))
.collect::<Result<Vec<_>>>()?;
scalars.push(scalar_values);
fn generic_collect<OffsetSize: OffsetSizeTrait>(
array: &dyn Array,
) -> Result<Vec<Vec<ScalarValue>>> {
array
.as_list::<OffsetSize>()
.iter()
.map(|nested_array| match nested_array {
Some(nested_array) => (0..nested_array.len())
.map(|i| ScalarValue::try_from_array(&nested_array, i))
.collect::<Result<Vec<_>>>(),
// TODO: what can we put for null?
None => Ok(vec![]),
})
.collect()
Copy link
Contributor Author

@Jefffrey Jefffrey Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it used to do array.as_list::<i32>().value(index), this never checked for nulls before.

So in the test case I added below:

// Funky (null slot has non-zero list offsets)
// Offsets + Values looks like this: [[1, 2], [3, 4], [5]]
// But with NullBuffer it's like this: [[1, 2], NULL, [5]]
let funky = ListArray::new(
    Field::new_list_field(DataType::Int64, true).into(),
    OffsetBuffer::new(vec![0, 2, 4, 5].into()),
    Arc::new(Int64Array::from(vec![1, 2, 3, 4, 5, 6])),
    Some(NullBuffer::from(vec![true, false, true])),
);
let converted = ScalarValue::convert_array_to_scalar_vec(&funky).unwrap();
assert_eq!(
    converted,
    vec![
        vec![ScalarValue::Int64(Some(1)), ScalarValue::Int64(Some(2))],
        vec![],
        vec![ScalarValue::Int64(Some(5))],
    ]
);

For the output, it incorrect would have vec![3, 4] instead of the empty Vec (it's accessing that element as if it were a valid list when in fact it was null in the parent list).

For now I made nulls return empty list to not change the signature of the method, though this seems undesirable 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would have to research what the null semantics of array_has are in other implementations (postgres/duckdb/spark) and try and make this one consistent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can file a ticket / leave a reference to that ticket in the comments 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raised #17749 👍

Copy link
Contributor

@vegarsti vegarsti Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb By null semantics what do you mean exactly? Do you mean what this returns? select array_contains([NULL], 1)

If so, DuckDB returns false:

D select list_contains([NULL], 1);
┌─────────────────────────────────────────┐
│ list_contains(main.list_value(NULL), 1) │
│                 boolean                 │
├─────────────────────────────────────────┤
│ false                                   │
└─────────────────────────────────────────┘

Spark returns null:
CleanShot-2025-10-02-19-55-07@2x

Postgres fails (probably I'm doing something wrong):

SELECT 1 = ANY(ARRAY[null]);
ERROR:  42883: operator does not exist: integer = text
LINE 1: SELECT 1 = ANY(ARRAY[null]);
                 ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.
LOCATION:  op_error, parse_oper.c:656

Or the other way around: select array_contains(array[1], null)

DuckDB returns null

D select list_contains([1], null);
┌─────────────────────────────────────────┐
│ list_contains(main.list_value(1), NULL) │
│                 boolean                 │
├─────────────────────────────────────────┤
│ NULL                                    │
└─────────────────────────────────────────┘

Spark errors due to datatype mismatch.

Postgres returns null:

SELECT null = ANY(ARRAY[1,2,3]);  -- true
 ?column?
----------
 ¤

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made an attempt at fixing this here #17891, but I'm not sure about the null semantics. I should add a new test case for array_has, but haven't come up with one yet.

@Jefffrey Jefffrey marked this pull request as ready for review September 23, 2025 00:21
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Jefffrey

05)--------ProjectionExec: expr=[]
06)----------CoalesceBatchesExec: target_batch_size=8192
07)------------FilterExec: array_has([7f4b18de3cfeb9b4ac78c381ee2ad278, a, b, c], substr(md5(CAST(value@0 AS Utf8View)), 1, 32))
07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8View)), 1, 32) IN ([Literal { value: Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("a"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("b"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("c"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pretty terrible plan display (not made worse by this PR, of course)

Comment on lines 3308 to 3322
let mut scalars = Vec::with_capacity(array.len());

for index in 0..array.len() {
let nested_array = array.as_list::<i32>().value(index);
let scalar_values = (0..nested_array.len())
.map(|i| ScalarValue::try_from_array(&nested_array, i))
.collect::<Result<Vec<_>>>()?;
scalars.push(scalar_values);
fn generic_collect<OffsetSize: OffsetSizeTrait>(
array: &dyn Array,
) -> Result<Vec<Vec<ScalarValue>>> {
array
.as_list::<OffsetSize>()
.iter()
.map(|nested_array| match nested_array {
Some(nested_array) => (0..nested_array.len())
.map(|i| ScalarValue::try_from_array(&nested_array, i))
.collect::<Result<Vec<_>>>(),
// TODO: what can we put for null?
None => Ok(vec![]),
})
.collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would have to research what the null semantics of array_has are in other implementations (postgres/duckdb/spark) and try and make this one consistent

Comment on lines 3308 to 3322
let mut scalars = Vec::with_capacity(array.len());

for index in 0..array.len() {
let nested_array = array.as_list::<i32>().value(index);
let scalar_values = (0..nested_array.len())
.map(|i| ScalarValue::try_from_array(&nested_array, i))
.collect::<Result<Vec<_>>>()?;
scalars.push(scalar_values);
fn generic_collect<OffsetSize: OffsetSizeTrait>(
array: &dyn Array,
) -> Result<Vec<Vec<ScalarValue>>> {
array
.as_list::<OffsetSize>()
.iter()
.map(|nested_array| match nested_array {
Some(nested_array) => (0..nested_array.len())
.map(|i| ScalarValue::try_from_array(&nested_array, i))
.collect::<Result<Vec<_>>>(),
// TODO: what can we put for null?
None => Ok(vec![]),
})
.collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can file a ticket / leave a reference to that ticket in the comments 🤔

// if the haystack is a constant list, we can use an inlist expression which is more
// efficient because the haystack is not varying per-row
if let Expr::Literal(ScalarValue::List(array), _) = haystack {
// TODO: support LargeList
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

.map(|v| Expr::Literal(v, None))
.collect();

return Ok(ExprSimplifyResult::Simplified(Expr::InList(InList {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could potentially make this more concise using

                    return Ok(ExprSimplifyResult::Simplified(in_list(std::mem::take(needle), list, false)));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, applied the simplification 👍

@alamb alamb added this pull request to the merge queue Sep 24, 2025
@alamb
Copy link
Contributor

alamb commented Sep 24, 2025

Thanks again @Jefffrey

Merged via the queue into apache:main with commit 6ec14e9 Sep 24, 2025
28 checks passed
@Jefffrey Jefffrey deleted the array_has_inlist_largelist branch September 24, 2025 10:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Perf: simplify LargeList constant haystack in array_has to InList expr

3 participants