Closed
Description
Describe the bug
In string ops like .str.starts_with
we incorrectly fill missing values with False
instead of propagating the NA when using cudf-polars' streaming executor with multiple partitions.
Steps/Code to reproduce bug
import polars as pl
ldf = pl.LazyFrame({"a": ["a", 'b', None, 'b']})
q = ldf.select(pl.col("a").str.starts_with("a"))
q.collect(engine=pl.GPUEngine(executor="streaming", executor_options={"max_rows_per_partition": 2}))
outputs
shape: (4, 1)
┌───────┐
│ a │
│ --- │
│ bool │
╞═══════╡
│ true │
│ false │
│ false │
│ false │
└───────┘
Expected behavior
In [4]: q.collect()
Out[4]:
shape: (4, 1)
┌───────┐
│ a │
│ --- │
│ bool │
╞═══════╡
│ true │
│ false │
│ null │
│ false │
└───────┘
Additional context
Note that the dtype seems to be determined by the first partition.
If we have missing values in the first / all partitions then we're fine.
q = pl.LazyFrame({"a": ["a", None, "a", None]}).select(pl.col("a").str.starts_with("a"))
q.collect(engine=pl.GPUEngine(executor="streaming", executor_options={"max_rows_per_partition": 2}))
shape: (4, 1)
┌──────┐
│ a │
│ --- │
│ bool │
╞══════╡
│ true │
│ null │
│ true │
│ null │
└──────┘