Skip to content

Require dtype argument to cudf_polars Column container #19193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

mroeschke
Copy link
Contributor

Description

Depends on #19075

Following #19091, this PR ensure the Column always contains a DataType object such that Polars type metadata such as struct field names are preserved

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

mroeschke added 21 commits June 2, 2025 16:51
@mroeschke mroeschke self-assigned this Jun 17, 2025
@mroeschke mroeschke requested a review from a team as a code owner June 17, 2025 23:30
@mroeschke mroeschke added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 17, 2025
@github-actions github-actions bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Jun 17, 2025
@GPUtester GPUtester moved this to In Progress in cuDF Python Jun 17, 2025
if dtype_str.startswith("list["):
stripped = dtype_str.removeprefix("list[").removesuffix("]")
return pl.List(_dtype_short_repr_to_dtype(stripped))
return pl.datatypes.convert.dtype_short_repr_to_dtype(dtype_str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about the return type here: I see that https://github.com/pola-rs/polars/blob/405c371c41d71e4463829062ba297e3378cfd85d/py-polars/polars/datatypes/convert.py returns a PolarsDataType | None.

  1. How should we handle the None case (which seems to happen for invalid data types)?
  2. PolarsType is defined as Union["DataTypeClass", "DataType"]. Do we need to worry about the DataTypeClass variant ? I'm not exactly sure what this is.

Copy link
Contributor Author

@mroeschke mroeschke Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should we handle the None case (which seems to happen for invalid data types)?

If we allow Columns to have an optional, None, dtype I suppose a None return here could still be valid, but that would hide the root cause of an invalid data type

I suppose we can raise/add an assert here, but would it be better to do that during deserialization or before serialization?

Do we need to worry about the DataTypeClass variant ?

Good catch. Yes, it appears this method can return a polars.DataType class (DataTypeClass) and not necessarily an instance which is what our DataType container expects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we can raise/add an assert here, but would it be better to do that during deserialization or before serialization?

I think raising an error here (in the deserialization) is appropriate. If we're unable to parse a dtype then presumably something has gone very wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I was wondering why our CI didn't catch these. We configure mypy to follow imports for polars (instead of just using Any).

I think it's because we don't include polars as a dependency in our pre-commit config:

additional_dependencies: [types-cachetools]
.

@mroeschke mroeschke requested a review from TomAugspurger June 18, 2025 22:22
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice overall. One question about some of the dtype deserialization logic.

I'll follow up with a PR ensuring that we have sufficient test coverage for the serialization.

pl_type = pl.datatypes.convert.dtype_short_repr_to_dtype(dtype_str)
if pl_type is None:
raise ValueError(f"{dtype_str} was not able to be parsed by Polars.")
return pl_type() if inspect.isclass(pl_type) else pl_type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How safe is pl_type(), without any arguments, here? Some types (Array, Enum) require additional arguments. Maybe we don't support those yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the types we support in DataType, I believe this is fairly safe as I'm hoping that dtype_short_repr_to_dtype will return instances for types with parameters (polars.Datetime and polars.Duration).

For those types that we don't support that take arguments, those should be rejected when constructing a DataType

@mroeschke
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit c838f81 into rapidsai:branch-25.08 Jun 25, 2025
91 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF Python Jun 25, 2025
@mroeschke mroeschke deleted the ref/cudf_polars/from_table_types branch June 25, 2025 22:03
rapids-bot bot pushed a commit that referenced this pull request Jun 26, 2025
With #19193 and this PR, we'll not import `pyarrow` explicitly in `cudf_polars` xref #18534

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - Matthew Murray (https://github.com/Matt711)

URL: #19219
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf-polars Issues specific to cudf-polars improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants