-
Notifications
You must be signed in to change notification settings - Fork 952
Store polars Series instead of pyarrow Array in cudf_polars LiteralColumn expr #18564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store polars Series instead of pyarrow Array in cudf_polars LiteralColumn expr #18564
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
…m__` (#18712) closes #18573 unblocks #18564 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #18712
Closes #18745 and Closes #18819 and contributes to #18534 #15132. I replaced as many pyarrow calls as I could with features supported in this PR. The only notable case I skipped is conversions to pylibcudf Columns from pyarrow string arrays. To support creating pylibcudf columns from string lists, I think we'd need #17192. Nevertheless that case will be covered by #18564 using `pl.Series` instead of `pa.array`. In the future, when we support #17192, we can use `plc.Column.from_list([...])` instead of `plc.Column(pl.Series(...))`. Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #18768
Follow up to #18768. Addresses this comment #18768 (comment). Also contributes to #18534 #15132 This PR adds support for creating pylibcudf Column from Python iterables containing strings and nested lists of strings. Should also unblock #18564 Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #18916
@@ -158,6 +158,10 @@ def pytest_configure(config: pytest.Config) -> None: | |||
"tests/unit/io/test_scan.py::test_async_read_21945[scan_type2]": "Debug output on stderr doesn't match", | |||
"tests/unit/io/test_scan.py::test_async_read_21945[scan_type3]": "Debug output on stderr doesn't match", | |||
"tests/unit/io/test_multiscan.py::test_multiscan_row_index[scan_csv-write_csv-csv]": "Debug output on stderr doesn't match", | |||
"tests/unit/functions/range/test_linear_space.py::test_linear_space_date": "Needs https://github.com/pola-rs/polars/issues/23020", | |||
"tests/unit/sql/test_temporal.py::test_implicit_temporal_strings[dt IN ('1960-01-07','2077-01-01','2222-02-22')-expected15]": "Needs https://github.com/pola-rs/polars/issues/23020", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the same issue as https://github.com/pola-rs/polars/issues/21660
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that was linked in the issue I created here.
if haystack.obj.type().id() == plc.TypeId.LIST: | ||
# Unwrap values from the list column | ||
haystack = Column(haystack.obj.children()[1]) | ||
# TODO: Remove check once Column's require dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting #19193
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully this PR goes in first and then we can simply remove this check in #19193.
if haystack.obj.type().id() == plc.TypeId.LIST: | ||
# Unwrap values from the list column | ||
haystack = Column(haystack.obj.children()[1]) | ||
# TODO: Remove check once Column's require dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully this PR goes in first and then we can simply remove this check in #19193.
# to a expr.LiteralColumn, so the actual type is in the inner type | ||
plc_dtype = DataType(haystack.dtype.polars.inner).plc | ||
else: | ||
plc_dtype = haystack.dtype.plc # pragma: no cover |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a test for this case? Is it difficult to test for some reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conditions where we hit this is covered by the Polars unit tests, but I can try to replicate this in our own unit tests in #19193
) | ||
|
||
|
||
def downcast_arrow_lists(typ: pa.DataType) -> pa.DataType: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love to see this going away.
/merge |
…olars (#19198) This PR and #18564 are the last PRs removing `pylibcudf.interop.to_arrow` in `cudf_polars`. Towards #18534 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #19198
Description
Contributes to #18534
Depends on:
pylibcudf.Column
accepting objects with__arrow_c_stream__
Checklist