-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
python: segfault when passing RecordBatchReader to PyArrow #1523
Comments
Hmm, the problem is that the PyArrow bindings are intentionally restrictive about how you can extend them, while we want to be able to pass on ADBC errors through this interface, so we have a wrapper...I suppose the wrapper needs to try harder to pretend to be a real reader. Right now, the DBAPI portion always wraps the stream in a RecordBatchReader. I think what I might try doing instead is have it use the low-level methods directly bypassing PyArrow, and if you instead request the reader, we just give you a plain PyArrow RecordBatchReader (which will not give you rich error metadata because we can't extend this interface in Python). Or I will see if we can finagle it so that we can link to PyArrow in Cython (however, this will mean we're dependent on particular PyArrow versions which I would very much like to avoid). |
Hmm, and we can't monkeypatch the reader object either, otherwise that would be a viable method :/ |
As a workaround, I have just noticed that |
Cool. I can reproduce and will get a fix out for the next release. Thanks for the report! |
…_batch We wrap RecordBatchReader to try to provide extra error metadata from ADBC when using this API. But PyArrow may try to extract the underlying C++ arrow::RecordBatch pointer from the reader, which won't exist in our wrapped reader. For the public API, return the unwrapped reader instead. Fixes apache#1523.
The 'fix' will just be to return |
thanks @lidavidm |
…_batch (#1530) We wrap RecordBatchReader to try to provide extra error metadata from ADBC when using this API. But PyArrow may try to extract the underlying C++ arrow::RecordBatch pointer from the reader, which won't exist in our wrapped reader. For the public API, return the unwrapped reader instead. Fixes #1523.
Hello,
I'm trying to integrate ADBC into a Flight RPC service. Typical use case is that a call (GetFlightInfo) performs a query and then client comes to pick stream of results via DoGet. So on GetFlightInfo service creates new cursor, executes and sends appropriate FlightInfo so that DoGet with the right ticket will pick up the data. For this the code does
cursor.fetch_record_batch()
. The resulting RecordBatchReader is then wrapped intopyarrow.flight.RecordBatchStream
and returned.Doing this crashes the entire server with SIGSEGV. You can find the reproducer in this gist: https://gist.github.com/lupko/8b6f165a6574ef830c531c8056b20957. The reproducer skips the GetFlightInfo for sakes of brevity.
Poking around the code of ADBC python wrappers, I think this crash happens because
AdbcRecordBatchReader
is not ready for interop with PyArrow .The PyArrow's
RecordBatchReader
(pyarrow.lib.RecordBatchReader
) hasreader
field that contains the actual C++ RecordBatchReader. PyArrow code usually grabs the actualreader
as soon as possible and uses it for the different purposes (like getting batches to send out via Flight RPC).The
AdbcRecordBatchReader
does not extendpyarrow.lib.RecordBatchReader
and soreader
is just not there and everything comes crashing down.The text was updated successfully, but these errors were encountered: