Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replacing \u0000 with nothing for PostgreSqlDatabaseDialect #1264

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

oleksandrkovalenko
Copy link

Problem

\u0000 UFT character is not supported by Postgres text field. An attempt to insert fails with error:

org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0x00`

Solution

We are replacing \u0000 character with nothing

Does this solution apply anywhere else?
  • yes
  • no
If yes, where?

Test Strategy

Testing done:
  • Unit tests
  • Integration tests
  • System tests
  • Manual tests

Release Plan

@oleksandrkovalenko oleksandrkovalenko requested a review from a team as a code owner November 4, 2022 13:03
@CLAassistant
Copy link

CLAassistant commented Nov 4, 2022

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@OneCricketeer
Copy link

How are your producers adding that byte sequence in the first place? Wouldn't it make more sense to fix your strings at the producer?

@oleksandr-blacklane
Copy link

@OneCricketeer In our case, our producer is a Kafka Connector which reads data from MySQL database. Null character \u0000 is supported by MySQL. In any case, we need to change one of the connectors to make this thing work.

@FreCap
Copy link

FreCap commented Mar 26, 2023

How are your producers adding that byte sequence in the first place? Wouldn't it make more sense to fix your strings at the producer?

Hi @OneCricketeer

According to: https://stackoverflow.com/a/1348551, 0x00 is never a supported character and MySQL produces it because it is a supported char.

PostgreSQL doesn't support storing NULL (\0x00) characters in text fields (this is obviously different from the database NULL value, which is fully supported).

Source: http://www.postgresql.org/docs/9.1/static/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE

Often time we don't control the source, so it needs to be done at the consumer.
At the same way, we don't even control the tables (100s of tables), and creating a list of all of the text "fields" would be hardly doable.

@oleksandrkovalenko, maybe we could put this behind a flag so we can ensure no behavior change?
@OneCricketeer would you agree with the PR if the change was behind a flag?

@roadSurfer
Copy link

If replacing with nothing the correct thing to do? We've elected to replace with the "Replace Character" (\ufffd) via an SMT for the moment.
Or is there a good reason to not do this?

@christianfeurer
Copy link

Stumbled across the same issue. As @FreCap mentioned NULL (\0x00) is not a valid character for postgres.
If I'm trying to insert this into as text column it's failing today:
org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0x00

As the suggested change will replace this \0x00 with an empty string is technically not the same. The suggested PostgreSQL solution would be to change the column definition from text to bytea to store hex values. For my personal usecase this is not an option. I'd benefit from opt-in replacement from \0x00 to empty string.

As the change is limited to the PostgreSqlDatabaseDialect the expected behavior change would be that if the opt-int option is taken then there is no more error when processing a message containing a \0x00 value for a text field. This would keep the connector up and running. That would be great :)

@oleksandrkovalenko
Copy link
Author

@FreCap Would you be able to share an example how can I put this change behind a flag?
Is it make sense to push this change forward. In our case we fixed those characters in the MySQL database. I believe that could not be possible for all use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants