Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider reporting errors during encoding in ReaderInputStream #6944

Open
3 tasks done
bjmi opened this issue Jan 26, 2024 · 6 comments · May be fixed by #7218
Open
3 tasks done

Consider reporting errors during encoding in ReaderInputStream #6944

bjmi opened this issue Jan 26, 2024 · 6 comments · May be fixed by #7218
Labels

Comments

@bjmi
Copy link

bjmi commented Jan 26, 2024

API(s)

ReaderInputStream(reader, charset, bufferSize)

How do you want it to be improved?

Use the default error action for malformed input and unmappable characters when creating the encoder by removing following lines:

.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE)

and update the corr. Javadoc.

Why do we need it to be improved?

Data is silently changed and conversion errors are just ignored.
This lead to corrupt data, it is particularly bad if it goes undetected for a long period of time.
Therefore the default behavior should make the user aware of the problem.
Newer introduced API (JDK 11) like java.nio.file.Files.readString(..) and java.nio.file.Files.writeString(..) also use the reporting error approach.

An alternative would be to make ReaderInputStream public and allow to pass a Reader and a CharsetEncoder.
The encoder can be created by the caller via charset.newEncoder() easily and configured according the intended use case.
This was requested in #5376.

Example

No code needs to be changed beforehand. It's just about reporting a problem.

Current Behavior

Encoding errors are silently ignored and lead to corrupt text files.

Desired Behavior

Encoding errors should raise an exception and make errors visible, subsequently the code or data in question gets fixed.

Concrete Use Cases

We process texts from customers that use German and Cyrillic letters and it is crucial that the content remains intact when decoding / encoding is used.

Checklist

@bjmi bjmi added the type=enhancement Make an existing feature better label Jan 26, 2024
@bjmi
Copy link
Author

bjmi commented Jan 29, 2024

Further ideas

  • Introduce com.google.common.io.CharStreams.asInputStream(Reader, Charset): InputStream and com.google.common.io.CharStreams.asInputStream(Reader, CharsetEncoder): InputStream
  • Overload com.google.common.io.CharSource.asByteSource(Supplier<CharsetEncoder>)

Passing a CharsetEncoder allows you to define the behavior on malformed input and on unmappable characters. The existing Charset is actually an encoder factory.

@chaoren
Copy link
Member

chaoren commented Feb 13, 2024

The relevant public API is CharSource.asByteSource(Charset).openStream(), right? I don't see ReaderInputStream being used from anywhere else.

@bjmi
Copy link
Author

bjmi commented Feb 14, 2024

You are right :)

@tobbi
Copy link

tobbi commented May 14, 2024

I gave this a shot in #7218. I'd appreciate any input you might have. I only implemented the suggestions in #6944 (comment) . Please don't yell at me if I did something wrong because I only based my changes on the original comment and am quite new to this codebase. ;)

@tobbi
Copy link

tobbi commented May 14, 2024

Should I try tackling the stuff mentioned in #6944 (comment) as well?

@bjmi
Copy link
Author

bjmi commented May 15, 2024

Should I try tackling the stuff mentioned in #6944 (comment) as well?

Thank you for your effort.
It would suffice to provide com.google.common.io.CharStreams.asInputStream(Reader, Charset): InputStream without changing CodingErrorActions of the encoder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants