-
Notifications
You must be signed in to change notification settings - Fork 952
Require dtype argument to cudf_polars Column
container
#19193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Require dtype argument to cudf_polars Column
container
#19193
Conversation
…olars/struct_expr
…olars/struct_expr
…olars/struct_expr
…eschke/cudf into feat/cudf_polars/struct_expr
…olars/struct_expr
…olars/struct_expr
…olars/struct_expr
…olars/struct_expr
if dtype_str.startswith("list["): | ||
stripped = dtype_str.removeprefix("list[").removesuffix("]") | ||
return pl.List(_dtype_short_repr_to_dtype(stripped)) | ||
return pl.datatypes.convert.dtype_short_repr_to_dtype(dtype_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question about the return type here: I see that https://github.com/pola-rs/polars/blob/405c371c41d71e4463829062ba297e3378cfd85d/py-polars/polars/datatypes/convert.py returns a PolarsDataType | None
.
- How should we handle the
None
case (which seems to happen for invalid data types)? PolarsType
is defined asUnion["DataTypeClass", "DataType"]
. Do we need to worry about theDataTypeClass
variant ? I'm not exactly sure what this is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should we handle the None case (which seems to happen for invalid data types)?
If we allow Column
s to have an optional, None
, dtype I suppose a None
return here could still be valid, but that would hide the root cause of an invalid data type
I suppose we can raise/add an assert
here, but would it be better to do that during deserialization or before serialization?
Do we need to worry about the DataTypeClass variant ?
Good catch. Yes, it appears this method can return a polars.DataType
class (DataTypeClass
) and not necessarily an instance which is what our DataType
container expects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we can raise/add an assert here, but would it be better to do that during deserialization or before serialization?
I think raising an error here (in the deserialization) is appropriate. If we're unable to parse a dtype then presumably something has gone very wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice overall. One question about some of the dtype deserialization logic.
I'll follow up with a PR ensuring that we have sufficient test coverage for the serialization.
pl_type = pl.datatypes.convert.dtype_short_repr_to_dtype(dtype_str) | ||
if pl_type is None: | ||
raise ValueError(f"{dtype_str} was not able to be parsed by Polars.") | ||
return pl_type() if inspect.isclass(pl_type) else pl_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How safe is pl_type()
, without any arguments, here? Some types (Array
, Enum
) require additional arguments. Maybe we don't support those yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the types we support in DataType
, I believe this is fairly safe as I'm hoping that dtype_short_repr_to_dtype
will return instances for types with parameters (polars.Datetime
and polars.Duration
).
For those types that we don't support that take arguments, those should be rejected when constructing a DataType
/merge |
With #19193 and this PR, we'll not import `pyarrow` explicitly in `cudf_polars` xref #18534 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Matthew Murray (https://github.com/Matt711) URL: #19219
Description
Depends on #19075
Following #19091, this PR ensure the
Column
always contains aDataType
object such that Polars type metadata such as struct field names are preservedChecklist