Skip to content

[BUG]: NA values incorrectly filled with False in String ops with streaming executor and multiple partitions #19148

Closed
@TomAugspurger

Description

@TomAugspurger

Describe the bug

In string ops like .str.starts_with we incorrectly fill missing values with False instead of propagating the NA when using cudf-polars' streaming executor with multiple partitions.

Steps/Code to reproduce bug

import polars as pl

ldf = pl.LazyFrame({"a": ["a", 'b', None, 'b']})
q = ldf.select(pl.col("a").str.starts_with("a"))

q.collect(engine=pl.GPUEngine(executor="streaming", executor_options={"max_rows_per_partition": 2}))

outputs

shape: (4, 1)
┌───────┐
│ a     │
│ ---   │
│ bool  │
╞═══════╡
│ true  │
│ false │
│ false │
│ false │
└───────┘

Expected behavior

In [4]: q.collect()
Out[4]: 
shape: (4, 1)
┌───────┐
│ a     │
│ ---   │
│ bool  │
╞═══════╡
│ true  │
│ false │
│ null  │
│ false │
└───────┘

Additional context

Note that the dtype seems to be determined by the first partition.
If we have missing values in the first / all partitions then we're fine.

q = pl.LazyFrame({"a": ["a", None, "a", None]}).select(pl.col("a").str.starts_with("a"))
q.collect(engine=pl.GPUEngine(executor="streaming", executor_options={"max_rows_per_partition": 2}))
shape: (4, 1)
┌──────┐
│ a    │
│ ---  │
│ bool │
╞══════╡
│ true │
│ null │
│ true │
│ null │
└──────┘

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions