python/postgresql: NUMERIC values corrupted in fetched results #1501

lupko · 2024-01-31T18:38:30Z

Hello,

I have just recently started prototyping with ADBC with goal to eventually integrate it into our system. Running queries and fetching results it the main use case for us.

I have started with PostgreSQL driver and have run into issues with NUMERIC columns in results. I understand the choice to do loss-less transfer as strings, however I find that these strings are in the end corrupt.

In my tests, I consume results as Arrow Tables.

There are two main things:

value of 0 never contains the decimal part. so for example
- With type NUMERIC(10, 4) ends up being .0000,
- With type NUMERIC(10, 0) ends up being empty string
When using particular numeric scale, all values get corrupt - there is trailing garbage that cannot be converted to UTF-8.

Here is an all-in-one reproducer: https://gist.github.com/lupko/336da65b37aade5dc2433004e2720d8e

A sample output (NUMERIC(16, 5):

adbc_driver_postgresql 0.9.0
PostgreSQL 16.1 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 13.2.1 20231011 (Red Hat 13.2.1-4), 64-bit
row 000: |.0000U|
row 001: |1.0000 |
row 002: |1.0100 |
row 003: |1.0120 |
row 004: |1.0123 |
row 005: |1.0123!|
row 006: |1.0123 |
row 007: |1.0123 |
row 008: |1.0123�|
row 009: |1.0123 |
row 010: |1.0123m|
row 011: |1.0123t|

Thanks for this great project. Please let me know if I can be of any further help.

The text was updated successfully, but these errors were encountered:

lidavidm · 2024-01-31T18:40:00Z

Hmm. @WillAyd or @paleolimbot happen to have any ideas?

lupko · 2024-01-31T18:45:09Z

also forgot to mention in the ticket... re-running the reproducer very often leads to different corruption - e.g. the trailing garbage changes over test runs.

paleolimbot · 2024-01-31T18:54:27Z

Thank you for this and the reproducer! The code to translate Postgres COPY for NUMERIC into a string is very complicated (I translated the Postgres implementation as literally as I could):

arrow-adbc/c/driver/postgresql/copy/reader.h

Lines 275 to 393 in d6694a2

    
           int16_t ndigits = ReadUnsafe<int16_t>(data); 
        
           int16_t weight = ReadUnsafe<int16_t>(data); 
        
           uint16_t sign = ReadUnsafe<uint16_t>(data); 
        
           uint16_t dscale = ReadUnsafe<uint16_t>(data); 
        
           if (data->size_bytes < static_cast<int64_t>(ndigits * sizeof(int16_t))) { 
        
             ArrowErrorSet(error, 
        
                           "Expected at least %d bytes of field data for numeric digits copy " 
        
                           "data but only %d bytes of input remain", 
        
                           static_cast<int>(ndigits * sizeof(int16_t)), 
        
                           static_cast<int>(data->size_bytes));  // NOLINT(runtime/int) 
        
             return EINVAL; 
        
           } 
        
           digits_.clear(); 
        
           for (int16_t i = 0; i < ndigits; i++) { 
        
             digits_.push_back(ReadUnsafe<int16_t>(data)); 
        
           } 
        
           // Handle special values 
        
           std::string special_value; 
        
           switch (sign) { 
        
             case kNumericNAN: 
        
               special_value = std::string("nan"); 
        
               break; 
        
             case kNumericPinf: 
        
               special_value = std::string("inf"); 
        
               break; 
        
             case kNumericNinf: 
        
               special_value = std::string("-inf"); 
        
               break; 
        
             case kNumericPos: 
        
             case kNumericNeg: 
        
               special_value = std::string(""); 
        
               break; 
        
             default: 
        
               ArrowErrorSet(error, 
        
                             "Unexpected value for sign read from Postgres numeric field: %d", 
        
                             static_cast<int>(sign)); 
        
               return EINVAL; 
        
           } 
        
           if (!special_value.empty()) { 
        
             NANOARROW_RETURN_NOT_OK( 
        
                 ArrowBufferAppend(data_, special_value.data(), special_value.size())); 
        
             NANOARROW_RETURN_NOT_OK(ArrowBufferAppendInt32(offsets_, data_->size_bytes)); 
        
             return AppendValid(array); 
        
           } 
        
           // Calculate string space requirement 
        
           int64_t max_chars_required = std::max<int64_t>(1, (weight + 1) * kDecDigits); 
        
           max_chars_required += dscale + kDecDigits + 2; 
        
           NANOARROW_RETURN_NOT_OK(ArrowBufferReserve(data_, max_chars_required)); 
        
           char* out0 = reinterpret_cast<char*>(data_->data + data_->size_bytes); 
        
           char* out = out0; 
        
           // Build output string in-place, starting with the negative sign 
        
           if (sign == kNumericNeg) { 
        
             *out++ = '-'; 
        
           } 
        
           // ...then digits before the decimal point 
        
           int d; 
        
           int d1; 
        
           int16_t dig; 
        
           if (weight < 0) { 
        
             d = weight + 1; 
        
             *out++ = '0'; 
        
           } else { 
        
             for (d = 0; d <= weight; d++) { 
        
               if (d < ndigits) { 
        
                 dig = digits_[d]; 
        
               } else { 
        
                 dig = 0; 
        
               } 
        
               // To strip leading zeroes 
        
               int append = (d > 0); 
        
               for (const auto pow10 : {1000, 100, 10, 1}) { 
        
                 d1 = dig / pow10; 
        
                 dig -= d1 * pow10; 
        
                 append |= (d1 > 0); 
        
                 if (append) { 
        
                   *out++ = d1 + '0'; 
        
                 } 
        
               } 
        
             } 
        
           } 
        
           // ...then the decimal point + digits after it. This may write more digits 
        
           // than specified by dscale so we need to keep track of how many we want to 
        
           // keep here. 
        
           int64_t actual_chars_required = out - out0; 
        
           if (dscale > 0) { 
        
             *out++ = '.'; 
        
             actual_chars_required += dscale + 1; 
        
             for (int i = 0; i < dscale; i++, d++, i += kDecDigits) { 
        
               if (d >= 0 && d < ndigits) { 
        
                 dig = digits_[d]; 
        
               } else { 
        
                 dig = 0; 
        
               } 
        
               for (const auto pow10 : {1000, 100, 10, 1}) { 
        
                 d1 = dig / pow10; 
        
                 dig -= d1 * pow10; 
        
                 *out++ = d1 + '0'; 
        
               } 
        
             } 
        
           } 
        
           // Update data buffer size and add offsets 
        
           data_->size_bytes += actual_chars_required; 
        
           NANOARROW_RETURN_NOT_OK(ArrowBufferAppendInt32(offsets_, data_->size_bytes)); 
        
           return AppendValid(array);

I am guessing there is an error there. I can can take a look in the next few days!

paleolimbot · 2024-01-31T18:55:30Z

My guess is that the value of actual_chars_required is not always correct in our implementation!

lupko · 2024-01-31T19:48:06Z

@paleolimbot i have some free time so poking around the code. started comparing the funky code with the original from PostgreSQL (it's a bit late here so i'm on sort of autopilot :))

I think the missing leading zero is due to this: maint-0.9.0...lupko:arrow-adbc:maint-0.9.0#diff-c5dc283c26154e7a7e1291b73f5131cf4a180240729fea7f459980f60b08486cR438

i seems a clear diff from the original code.

trying locally with those changes and I see the leading zero is restored

lupko · 2024-01-31T19:57:09Z

the trailing corruption is likely due to this: lupko@f171a17

lupko · 2024-01-31T20:12:45Z

here is PR that makes the conversion faithful to the PostgreSQL counterpart: https://github.com/apache/arrow-adbc/pull/1502/files

local testing shows the singular zero before decimal is present now + the corruption is gone.

- conversion from numeric to string had two bugs due to deviations from the original PostgreSQL code - leading single zero before decimal would always be dropped - in some cases, the numbers after decimal would not be incomplete and instead replaced with 'garbage' here is the PostgreSQL code: https://github.com/postgres/postgres/blob/9589b038d3203cd5ba708fb4f5c23182c88ad0b3/src/backend/utils/adt/numeric.c#L7443 (the DEC_DIGITS=4 variant) Fixes #1501.

lidavidm added this to the ADBC Libraries 0.10.0 milestone Jan 31, 2024

lidavidm mentioned this issue Jan 31, 2024

fix(c/driver/postgresql): fix numeric to str #1502

Merged

lidavidm assigned lupko Jan 31, 2024

lidavidm closed this as completed in #1502 Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python/postgresql: NUMERIC values corrupted in fetched results #1501

python/postgresql: NUMERIC values corrupted in fetched results #1501

lupko commented Jan 31, 2024 •

edited

Loading

lidavidm commented Jan 31, 2024

lupko commented Jan 31, 2024

paleolimbot commented Jan 31, 2024

paleolimbot commented Jan 31, 2024

lupko commented Jan 31, 2024

lupko commented Jan 31, 2024

lupko commented Jan 31, 2024 •

edited

Loading

python/postgresql: NUMERIC values corrupted in fetched results #1501

python/postgresql: NUMERIC values corrupted in fetched results #1501

Comments

lupko commented Jan 31, 2024 • edited Loading

lidavidm commented Jan 31, 2024

lupko commented Jan 31, 2024

paleolimbot commented Jan 31, 2024

paleolimbot commented Jan 31, 2024

lupko commented Jan 31, 2024

lupko commented Jan 31, 2024

lupko commented Jan 31, 2024 • edited Loading

lupko commented Jan 31, 2024 •

edited

Loading

lupko commented Jan 31, 2024 •

edited

Loading