Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python/postgresql: NUMERIC values corrupted in fetched results #1501

Closed
lupko opened this issue Jan 31, 2024 · 7 comments · Fixed by #1502
Closed

python/postgresql: NUMERIC values corrupted in fetched results #1501

lupko opened this issue Jan 31, 2024 · 7 comments · Fixed by #1502
Assignees

Comments

@lupko
Copy link
Contributor

lupko commented Jan 31, 2024

Hello,

I have just recently started prototyping with ADBC with goal to eventually integrate it into our system. Running queries and fetching results it the main use case for us.

I have started with PostgreSQL driver and have run into issues with NUMERIC columns in results. I understand the choice to do loss-less transfer as strings, however I find that these strings are in the end corrupt.

In my tests, I consume results as Arrow Tables.

There are two main things:

  • value of 0 never contains the decimal part. so for example

    • With type NUMERIC(10, 4) ends up being .0000,
    • With type NUMERIC(10, 0) ends up being empty string
  • When using particular numeric scale, all values get corrupt - there is trailing garbage that cannot be converted to UTF-8.

Here is an all-in-one reproducer: https://gist.github.com/lupko/336da65b37aade5dc2433004e2720d8e

A sample output (NUMERIC(16, 5):

adbc_driver_postgresql 0.9.0
PostgreSQL 16.1 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 13.2.1 20231011 (Red Hat 13.2.1-4), 64-bit
row 000: |.0000U|
row 001: |1.0000 |
row 002: |1.0100 |
row 003: |1.0120 |
row 004: |1.0123 |
row 005: |1.0123!|
row 006: |1.0123 |
row 007: |1.0123 |
row 008: |1.0123�|
row 009: |1.0123 |
row 010: |1.0123m|
row 011: |1.0123t|

Thanks for this great project. Please let me know if I can be of any further help.

@lidavidm lidavidm added this to the ADBC Libraries 0.10.0 milestone Jan 31, 2024
@lidavidm
Copy link
Member

Hmm. @WillAyd or @paleolimbot happen to have any ideas?

@lupko
Copy link
Contributor Author

lupko commented Jan 31, 2024

also forgot to mention in the ticket... re-running the reproducer very often leads to different corruption - e.g. the trailing garbage changes over test runs.

@paleolimbot
Copy link
Member

Thank you for this and the reproducer! The code to translate Postgres COPY for NUMERIC into a string is very complicated (I translated the Postgres implementation as literally as I could):

int16_t ndigits = ReadUnsafe<int16_t>(data);
int16_t weight = ReadUnsafe<int16_t>(data);
uint16_t sign = ReadUnsafe<uint16_t>(data);
uint16_t dscale = ReadUnsafe<uint16_t>(data);
if (data->size_bytes < static_cast<int64_t>(ndigits * sizeof(int16_t))) {
ArrowErrorSet(error,
"Expected at least %d bytes of field data for numeric digits copy "
"data but only %d bytes of input remain",
static_cast<int>(ndigits * sizeof(int16_t)),
static_cast<int>(data->size_bytes)); // NOLINT(runtime/int)
return EINVAL;
}
digits_.clear();
for (int16_t i = 0; i < ndigits; i++) {
digits_.push_back(ReadUnsafe<int16_t>(data));
}
// Handle special values
std::string special_value;
switch (sign) {
case kNumericNAN:
special_value = std::string("nan");
break;
case kNumericPinf:
special_value = std::string("inf");
break;
case kNumericNinf:
special_value = std::string("-inf");
break;
case kNumericPos:
case kNumericNeg:
special_value = std::string("");
break;
default:
ArrowErrorSet(error,
"Unexpected value for sign read from Postgres numeric field: %d",
static_cast<int>(sign));
return EINVAL;
}
if (!special_value.empty()) {
NANOARROW_RETURN_NOT_OK(
ArrowBufferAppend(data_, special_value.data(), special_value.size()));
NANOARROW_RETURN_NOT_OK(ArrowBufferAppendInt32(offsets_, data_->size_bytes));
return AppendValid(array);
}
// Calculate string space requirement
int64_t max_chars_required = std::max<int64_t>(1, (weight + 1) * kDecDigits);
max_chars_required += dscale + kDecDigits + 2;
NANOARROW_RETURN_NOT_OK(ArrowBufferReserve(data_, max_chars_required));
char* out0 = reinterpret_cast<char*>(data_->data + data_->size_bytes);
char* out = out0;
// Build output string in-place, starting with the negative sign
if (sign == kNumericNeg) {
*out++ = '-';
}
// ...then digits before the decimal point
int d;
int d1;
int16_t dig;
if (weight < 0) {
d = weight + 1;
*out++ = '0';
} else {
for (d = 0; d <= weight; d++) {
if (d < ndigits) {
dig = digits_[d];
} else {
dig = 0;
}
// To strip leading zeroes
int append = (d > 0);
for (const auto pow10 : {1000, 100, 10, 1}) {
d1 = dig / pow10;
dig -= d1 * pow10;
append |= (d1 > 0);
if (append) {
*out++ = d1 + '0';
}
}
}
}
// ...then the decimal point + digits after it. This may write more digits
// than specified by dscale so we need to keep track of how many we want to
// keep here.
int64_t actual_chars_required = out - out0;
if (dscale > 0) {
*out++ = '.';
actual_chars_required += dscale + 1;
for (int i = 0; i < dscale; i++, d++, i += kDecDigits) {
if (d >= 0 && d < ndigits) {
dig = digits_[d];
} else {
dig = 0;
}
for (const auto pow10 : {1000, 100, 10, 1}) {
d1 = dig / pow10;
dig -= d1 * pow10;
*out++ = d1 + '0';
}
}
}
// Update data buffer size and add offsets
data_->size_bytes += actual_chars_required;
NANOARROW_RETURN_NOT_OK(ArrowBufferAppendInt32(offsets_, data_->size_bytes));
return AppendValid(array);

I am guessing there is an error there. I can can take a look in the next few days!

@paleolimbot
Copy link
Member

My guess is that the value of actual_chars_required is not always correct in our implementation!

@lupko
Copy link
Contributor Author

lupko commented Jan 31, 2024

@paleolimbot i have some free time so poking around the code. started comparing the funky code with the original from PostgreSQL (it's a bit late here so i'm on sort of autopilot :))

I think the missing leading zero is due to this: maint-0.9.0...lupko:arrow-adbc:maint-0.9.0#diff-c5dc283c26154e7a7e1291b73f5131cf4a180240729fea7f459980f60b08486cR438

i seems a clear diff from the original code.

trying locally with those changes and I see the leading zero is restored

@lupko
Copy link
Contributor Author

lupko commented Jan 31, 2024

the trailing corruption is likely due to this: lupko@f171a17

@lupko
Copy link
Contributor Author

lupko commented Jan 31, 2024

here is PR that makes the conversion faithful to the PostgreSQL counterpart: https://github.com/apache/arrow-adbc/pull/1502/files

local testing shows the singular zero before decimal is present now + the corruption is gone.

lidavidm pushed a commit that referenced this issue Feb 1, 2024
- conversion from numeric to string had two bugs due to deviations from
the original PostgreSQL code
- leading single zero before decimal would always be dropped
- in some cases, the numbers after decimal would not be incomplete and
instead replaced with 'garbage'

here is the PostgreSQL code:
https://github.com/postgres/postgres/blob/9589b038d3203cd5ba708fb4f5c23182c88ad0b3/src/backend/utils/adt/numeric.c#L7443

(the DEC_DIGITS=4 variant)

Fixes #1501.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants