Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(c/driver/postgresql): optionally return NUMERIC columns as double #1513

Open
lupko opened this issue Feb 5, 2024 · 9 comments
Open

Comments

@lupko
Copy link
Contributor

lupko commented Feb 5, 2024

Hi all,

I am currently evaluating/trying out ABDC (Python) in the following context:

  • Have a Flight RPC data service - let's call it 'connector' - that is responsible for handling arbitrary SQL queries to PostgreSQL and producing results as stream of Arrow data
  • The result provided by 'connector' is then used by other services that do additional 'post-SQL' processing; the result of 'connector' is further 'massaged' using different algorithms (typically aggregations or ML algorithms).
  • The PostgreSQL data stored in NUMERIC columns typically represents some measurable value - which is used in the different algorithms applied by the other services.
  • It is a hard constraint that the results produced by the 'connector' can be used as-is in the subsequent steps.

In this context, the current ADBC PostgreSQL behavior where all NUMERIC columns are converted to string is not ideal. I understand the motivation - ADBC does not want to do lossy conversion by default and string is the only safe bet at the moment; I need to find a way out though :)

I was entertaining the idea that the 'connector' talking to PostgreSQL would perform the conversion on the fly so that it produces data where NUMERICs are in the end doubles (e.g. pipe the record batches produced by ADBC via Acero to perform str->double conversion on the way out of the service).

The tricky part here is that i'm unable to make this work in a simple deterministic way with just the SQL 'in hand': ADBC result gives no indication (or I was unable to find it? please let me know) that a string column in the result was in fact created from NUMERIC data.


The ideal solution in my context would be for ADBC PostgreSQL to return NUMERIC columns as double. I read through the issues and found that you have been considering this as optional behavior.

I dug into this a little bit and found in PostgreSQL source that it does numeric -> double conversion by calling strtod on the string representation (e.g. the algorithm that is already ported to ADBC's PostgreSQL driver). So if ADBC was to follow the suit, it could reuse the existing numeric -> string conversion and then do strtod on top of it - it could be quite simple.

For reference, here is a working prototype: lupko@6d52952

  • refactored reader so that the numeric -> str code & friends from PostgreSQL are in common base class
  • there is one class for Numeric to String; as before, the code populates Arrow buffers directly
  • there is another class for Numeric to Double; string representation is written into temporary location from where code does strtod and populates Arrow buffer with the double value

What is missing in the prototype is some logic that decides whether to produce string or double - code in prototype always produces double.

If you find this general direction acceptable, I could take it further and complete it. I just need a little extra guidance about making this optional. At what level should the setting be available? What should be the name? What is the best way to 'tunnel' the option down to the code that decides that Arrow type to use? (e.g. PostgresType::SetSchema seems like a good place - a simple conditional here would be enough; no need to add extra conditions into adbcpq::MakeCopyFieldReader)

@lidavidm lidavidm added this to the ADBC Libraries 0.10.0 milestone Feb 5, 2024
@lidavidm
Copy link
Member

lidavidm commented Feb 5, 2024

A statement-level option seems good. Maybe @paleolimbot or @WillAyd would have more of an opinion on how to structure the code. We may want to consider doing the double conversion directly at some point instead of bouncing through a string.

We could consider a separate issue to add the postgres type specifier to the result schema so that you could identify cases like this?

@paleolimbot
Copy link
Member

Implementation wise, the shortest path is probably to parse the decimal text with from_chars() (which I think is locale-independent). The representation we get from COPY is also not bad to convert (it's groups of four digits from 0 to 9999); however, one would have to be careful about and test cases of over/underflow.

For types that we don't know how to deal with we already attach the database name as field metadata...we could do that for all types (or types like decimal and array where we know in advance that what we return is ambiguous). As you noted, then the client could handle that conversion as a workaround.

Interface wise, I hope there is a way we can avoid AdbcStatementSetOption("adbc.statement.XXX_as_XXX", 1). There are a lot of options regarding how to calculate an Arrow type from a database type and it's easy to accumulate many of these. I wish there was a AdbcStatementRequestSchema() to handle the case where the client already knows what they want (perhaps after inspecting AdbcStatementExecuteSchema()); however, I don't think there's a clean way to handle that with the existing API. It would fit nicely with things like the Python capsule protocol (___arrow_c_array_stream__(requested_schema=xxx)).

@lupko
Copy link
Contributor Author

lupko commented Feb 5, 2024

indeed having the postgres type specifier also occurred to me but I did not notice it is already done for some other types so went with the conversion first. I'll create separate issue & PR for that.

imho in the end it could make sense to have this extra source-type info every time when a concrete type is converted to a more generic one or there is loss of semantic. like numeric to string is one instance. but there is for instance OID -> int32 and few others. wdyt?

@lupko
Copy link
Contributor Author

lupko commented Feb 5, 2024

@paleolimbot good point on the from_chars(), i was not aware that strtod() uses current locale; since from_chars() is available from C++ 17 onward, i guess this would have to wait for #1431?

@WillAyd
Copy link
Contributor

WillAyd commented Feb 5, 2024

@lupko you can already start using C++17

@lidavidm
Copy link
Member

lidavidm commented Feb 5, 2024

@paleolimbot want to add something to this milestone? https://github.com/apache/arrow-adbc/milestone/8

@lupko lupko changed the title feature(c/driver/postgresql): optionally return NUMERIC columns as double feat(c/driver/postgresql): optionally return NUMERIC columns as double Feb 5, 2024
@lupko
Copy link
Contributor Author

lupko commented Feb 6, 2024

I have been poking into this some more and explored the optional nature of the conversion. Alas, not exactly as described in #1514 because that is just too advanced for me to do in reasonable time :(.

Anyway, I have created a draft PR for you guys to consider. In this draft, there is a new statement-level option adbc.postgresql.numeric_conversion that can be of value to_string, to_double with the idea that if one day ADBC PostgreSQL supports (limited) conversion to decimal128/256, then the value of the option could be to_decimal or whatnot. That way the number of possible options is limited.

I think an option like this still makes sense regardless of what is proposed in #1514. My (twisted :)) reasoning goes as follows: PostgreSQL NUMERIC cannot be always 1-1 mapped to Arrow types. The driver thus needs to compromise and fall back to some kind of conversion when creating Arrow data. There can be multiple different strategies for this fall back. Driver allows for a simple configuration of that strategy. E.g. this is not about configuring type mapping but instead configuring fallback when 'native' mapping for some type is just not possible.

The approach with the option would not conflict with the fancier stuff done in #1514 - schema with desired types always wins, without it driver falls back to use the option.

Anyway, here is the PR draft: #1521

I understand if this is not the way you want to go so no hard feelings can just close this here ticket in favor of #1514 :). With the extra field-level metadata that is already in place, I can create workaround in client code already (but was hoping I could avoid it).

@paleolimbot
Copy link
Member

I can definitely see how an (or options) could be a workaround...the solution I suggested currently has an indeterminate timeline (requires an new ABI version) and I'm definitely not opposed to solutions that are possible today. Would the metadata solution (i.e., push the workaround into non-ADBC client code) also work or does that have performance/workflow implications that would make it unusable?

@lidavidm
Copy link
Member

@lupko is the current approach with the type metadata sufficient?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants