-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Support LargeList
in array_has
simplification to InList
#17732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
let mut scalars = Vec::with_capacity(array.len()); | ||
|
||
for index in 0..array.len() { | ||
let nested_array = array.as_list::<i32>().value(index); | ||
let scalar_values = (0..nested_array.len()) | ||
.map(|i| ScalarValue::try_from_array(&nested_array, i)) | ||
.collect::<Result<Vec<_>>>()?; | ||
scalars.push(scalar_values); | ||
fn generic_collect<OffsetSize: OffsetSizeTrait>( | ||
array: &dyn Array, | ||
) -> Result<Vec<Vec<ScalarValue>>> { | ||
array | ||
.as_list::<OffsetSize>() | ||
.iter() | ||
.map(|nested_array| match nested_array { | ||
Some(nested_array) => (0..nested_array.len()) | ||
.map(|i| ScalarValue::try_from_array(&nested_array, i)) | ||
.collect::<Result<Vec<_>>>(), | ||
// TODO: what can we put for null? | ||
None => Ok(vec![]), | ||
}) | ||
.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it used to do array.as_list::<i32>().value(index)
, this never checked for nulls before.
So in the test case I added below:
// Funky (null slot has non-zero list offsets)
// Offsets + Values looks like this: [[1, 2], [3, 4], [5]]
// But with NullBuffer it's like this: [[1, 2], NULL, [5]]
let funky = ListArray::new(
Field::new_list_field(DataType::Int64, true).into(),
OffsetBuffer::new(vec![0, 2, 4, 5].into()),
Arc::new(Int64Array::from(vec![1, 2, 3, 4, 5, 6])),
Some(NullBuffer::from(vec![true, false, true])),
);
let converted = ScalarValue::convert_array_to_scalar_vec(&funky).unwrap();
assert_eq!(
converted,
vec![
vec![ScalarValue::Int64(Some(1)), ScalarValue::Int64(Some(2))],
vec![],
vec![ScalarValue::Int64(Some(5))],
]
);
For the output, it incorrect would have vec![3, 4]
instead of the empty Vec (it's accessing that element as if it were a valid list when in fact it was null in the parent list).
For now I made nulls return empty list to not change the signature of the method, though this seems undesirable 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we would have to research what the null semantics of array_has are in other implementations (postgres/duckdb/spark) and try and make this one consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can file a ticket / leave a reference to that ticket in the comments 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raised #17749 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb By null semantics what do you mean exactly? Do you mean what this returns? select array_contains([NULL], 1)
If so, DuckDB returns false
:
D select list_contains([NULL], 1);
┌─────────────────────────────────────────┐
│ list_contains(main.list_value(NULL), 1) │
│ boolean │
├─────────────────────────────────────────┤
│ false │
└─────────────────────────────────────────┘
Postgres fails (probably I'm doing something wrong):
SELECT 1 = ANY(ARRAY[null]);
ERROR: 42883: operator does not exist: integer = text
LINE 1: SELECT 1 = ANY(ARRAY[null]);
^
HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
LOCATION: op_error, parse_oper.c:656
Or the other way around: select array_contains(array[1], null)
DuckDB returns null
D select list_contains([1], null);
┌─────────────────────────────────────────┐
│ list_contains(main.list_value(1), NULL) │
│ boolean │
├─────────────────────────────────────────┤
│ NULL │
└─────────────────────────────────────────┘
Spark errors due to datatype mismatch.
Postgres returns null
:
SELECT null = ANY(ARRAY[1,2,3]); -- true
?column?
----------
¤
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made an attempt at fixing this here #17891, but I'm not sure about the null semantics. I should add a new test case for array_has, but haven't come up with one yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Jefffrey
05)--------ProjectionExec: expr=[] | ||
06)----------CoalesceBatchesExec: target_batch_size=8192 | ||
07)------------FilterExec: array_has([7f4b18de3cfeb9b4ac78c381ee2ad278, a, b, c], substr(md5(CAST(value@0 AS Utf8View)), 1, 32)) | ||
07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8View)), 1, 32) IN ([Literal { value: Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("a"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("b"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("c"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a pretty terrible plan display (not made worse by this PR, of course)
let mut scalars = Vec::with_capacity(array.len()); | ||
|
||
for index in 0..array.len() { | ||
let nested_array = array.as_list::<i32>().value(index); | ||
let scalar_values = (0..nested_array.len()) | ||
.map(|i| ScalarValue::try_from_array(&nested_array, i)) | ||
.collect::<Result<Vec<_>>>()?; | ||
scalars.push(scalar_values); | ||
fn generic_collect<OffsetSize: OffsetSizeTrait>( | ||
array: &dyn Array, | ||
) -> Result<Vec<Vec<ScalarValue>>> { | ||
array | ||
.as_list::<OffsetSize>() | ||
.iter() | ||
.map(|nested_array| match nested_array { | ||
Some(nested_array) => (0..nested_array.len()) | ||
.map(|i| ScalarValue::try_from_array(&nested_array, i)) | ||
.collect::<Result<Vec<_>>>(), | ||
// TODO: what can we put for null? | ||
None => Ok(vec![]), | ||
}) | ||
.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we would have to research what the null semantics of array_has are in other implementations (postgres/duckdb/spark) and try and make this one consistent
let mut scalars = Vec::with_capacity(array.len()); | ||
|
||
for index in 0..array.len() { | ||
let nested_array = array.as_list::<i32>().value(index); | ||
let scalar_values = (0..nested_array.len()) | ||
.map(|i| ScalarValue::try_from_array(&nested_array, i)) | ||
.collect::<Result<Vec<_>>>()?; | ||
scalars.push(scalar_values); | ||
fn generic_collect<OffsetSize: OffsetSizeTrait>( | ||
array: &dyn Array, | ||
) -> Result<Vec<Vec<ScalarValue>>> { | ||
array | ||
.as_list::<OffsetSize>() | ||
.iter() | ||
.map(|nested_array| match nested_array { | ||
Some(nested_array) => (0..nested_array.len()) | ||
.map(|i| ScalarValue::try_from_array(&nested_array, i)) | ||
.collect::<Result<Vec<_>>>(), | ||
// TODO: what can we put for null? | ||
None => Ok(vec![]), | ||
}) | ||
.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can file a ticket / leave a reference to that ticket in the comments 🤔
// if the haystack is a constant list, we can use an inlist expression which is more | ||
// efficient because the haystack is not varying per-row | ||
if let Expr::Literal(ScalarValue::List(array), _) = haystack { | ||
// TODO: support LargeList |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
.map(|v| Expr::Literal(v, None)) | ||
.collect(); | ||
|
||
return Ok(ExprSimplifyResult::Simplified(Expr::InList(InList { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could potentially make this more concise using
return Ok(ExprSimplifyResult::Simplified(in_list(std::mem::take(needle), list, false)));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, applied the simplification 👍
Thanks again @Jefffrey |
Which issue does this PR close?
LargeList
constant haystack inarray_has
toInList
expr #15389Rationale for this change
Already works for regular
List
(andFixedSizeList
technically, since it gets coerced toList
). Extending this support toLargeList
s.What changes are included in this PR?
Uplift
ScalarValue::convert_array_to_scalar_vec
to be generic across bothList
&LargeList
and enable this simplification forLargeList
s insidearray_has
.Also, fix
ScalarValue::convert_array_to_scalar_vec
to properly account for nulls in nested arrays.Are these changes tested?
Fixed existing tests to have this new behaviour.
Are there any user-facing changes?
No.