Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python: segfault when passing RecordBatchReader to PyArrow #1523

Closed
lupko opened this issue Feb 7, 2024 · 6 comments · Fixed by #1530
Closed

python: segfault when passing RecordBatchReader to PyArrow #1523

lupko opened this issue Feb 7, 2024 · 6 comments · Fixed by #1530
Assignees

Comments

@lupko
Copy link
Contributor

lupko commented Feb 7, 2024

Hello,

I'm trying to integrate ADBC into a Flight RPC service. Typical use case is that a call (GetFlightInfo) performs a query and then client comes to pick stream of results via DoGet. So on GetFlightInfo service creates new cursor, executes and sends appropriate FlightInfo so that DoGet with the right ticket will pick up the data. For this the code does cursor.fetch_record_batch(). The resulting RecordBatchReader is then wrapped into pyarrow.flight.RecordBatchStream and returned.

Doing this crashes the entire server with SIGSEGV. You can find the reproducer in this gist: https://gist.github.com/lupko/8b6f165a6574ef830c531c8056b20957. The reproducer skips the GetFlightInfo for sakes of brevity.

Poking around the code of ADBC python wrappers, I think this crash happens because AdbcRecordBatchReader is not ready for interop with PyArrow .

The PyArrow's RecordBatchReader (pyarrow.lib.RecordBatchReader) has reader field that contains the actual C++ RecordBatchReader. PyArrow code usually grabs the actual reader as soon as possible and uses it for the different purposes (like getting batches to send out via Flight RPC).

The AdbcRecordBatchReader does not extend pyarrow.lib.RecordBatchReader and so reader is just not there and everything comes crashing down.

@lidavidm lidavidm self-assigned this Feb 7, 2024
@lidavidm lidavidm added this to the ADBC Libraries 0.10.0 milestone Feb 7, 2024
@lidavidm
Copy link
Member

lidavidm commented Feb 7, 2024

Hmm, the problem is that the PyArrow bindings are intentionally restrictive about how you can extend them, while we want to be able to pass on ADBC errors through this interface, so we have a wrapper...I suppose the wrapper needs to try harder to pretend to be a real reader.

Right now, the DBAPI portion always wraps the stream in a RecordBatchReader. I think what I might try doing instead is have it use the low-level methods directly bypassing PyArrow, and if you instead request the reader, we just give you a plain PyArrow RecordBatchReader (which will not give you rich error metadata because we can't extend this interface in Python). Or I will see if we can finagle it so that we can link to PyArrow in Cython (however, this will mean we're dependent on particular PyArrow versions which I would very much like to avoid).

@lidavidm
Copy link
Member

lidavidm commented Feb 7, 2024

Hmm, and we can't monkeypatch the reader object either, otherwise that would be a viable method :/

@lupko
Copy link
Contributor Author

lupko commented Feb 7, 2024

As a workaround, I have just noticed that AdbcRecordBatchReader._import_from_c actually uses PyArrow's RBR and keeps it in _reader. If i go dirty and use that fetch_record_batch()._reader then all works as expected. Indeed, not the ideal way to go but it unblocks my progress a bit.

@lidavidm
Copy link
Member

lidavidm commented Feb 7, 2024

Cool. I can reproduce and will get a fix out for the next release. Thanks for the report!

lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Feb 8, 2024
…_batch

We wrap RecordBatchReader to try to provide extra error metadata from
ADBC when using this API.  But PyArrow may try to extract the underlying
C++ arrow::RecordBatch pointer from the reader, which won't exist in our
wrapped reader.  For the public API, return the unwrapped reader
instead.

Fixes apache#1523.
@lidavidm
Copy link
Member

lidavidm commented Feb 8, 2024

The 'fix' will just be to return _reader for you; I've discussed some potential fancier fixes with PyArrow maintainers but this will work for the time being.

@lupko
Copy link
Contributor Author

lupko commented Feb 8, 2024

thanks @lidavidm

lidavidm added a commit that referenced this issue Feb 8, 2024
…_batch (#1530)

We wrap RecordBatchReader to try to provide extra error metadata from
ADBC when using this API. But PyArrow may try to extract the underlying
C++ arrow::RecordBatch pointer from the reader, which won't exist in our
wrapped reader. For the public API, return the unwrapped reader instead.

Fixes #1523.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants