Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write 4-byte characters (surrogate pairs) instead of escapes #1335

Merged
merged 7 commits into from
Sep 18, 2024

Conversation

rnetuka
Copy link
Contributor

@rnetuka rnetuka commented Sep 12, 2024

Issue: #223

A fix for issue #223. Some languages (Japanese, Chinese, ...) use 4-byte characters and it's annoying to have them marshaled as escapes.

I've tried to fix this with minimum intrusions. I've only added this feature/fix to _writeStringSegment2 method(s) and left the rest without it. If you like the changes for other methods calling _outputMultiByteChar, just let me know.

@pjfanning
Copy link
Member

  • this probably imposes a performance hit on all users
  • some users may even prefer the existing behaviour
  • I think this should be only applied if a new JsonGenerator.Feature is added and defaulted to false.

https://www.javadoc.io/static/com.fasterxml.jackson.core/jackson-core/2.17.2/com/fasterxml/jackson/core/JsonGenerator.Feature.html

@pjfanning
Copy link
Member

The change that you have made to PackageVersion.java.in breaks the build - please revert it.

@rnetuka
Copy link
Contributor Author

rnetuka commented Sep 12, 2024

@pjfanning Thanks for your review. I've added the feature. Not sure if StreamWriteFeature is the right place...

@pjfanning
Copy link
Member

@pjfanning Thanks for your review. I've added the feature. Not sure if StreamWriteFeature is the right place...

Seems ok to me. Do you know if other Generator implementations handle surrogates ok? Eg WriterBasedJsonGenerator.

@pjfanning
Copy link
Member

@rnetuka @cowtowncoder It looks like the Windows CI build cannot handle UTF-8 chars in source files.

@rnetuka
Copy link
Contributor Author

rnetuka commented Sep 12, 2024

@pjfanning WriterBasedJsonGenerator works fine and writes the surrogate pairs automatically.

As for UTF-8 chars in source files, do you mean the ones in the new test? If so, I can just replace them with escapes.

@pjfanning
Copy link
Member

@pjfanning WriterBasedJsonGenerator works fine and writes the surrogate pairs automatically.

As for UTF-8 chars in source files, do you mean the ones in the new test? If so, I can just replace them with escapes.

The Windows build doesn't like any of your non-ASCII chars. Could you have escape all the non-ASCII chars?

@cowtowncoder
Copy link
Member

Overall looks good; as long as overhead is localized into handling of 4-byte cases it should be fine (performance would be my main concern in general).

@cowtowncoder
Copy link
Member

@rnetuka Forgot to say: big thank you for contributing this!!! I'll try to make sure this gets in 2.18.0 release; as I said this looks pretty good already.

@rnetuka
Copy link
Contributor Author

rnetuka commented Sep 13, 2024

@cowtowncoder Based on you review, I've updated the PR with following changes:

  • moved the feature into the group // Misc other
  • added javadoc for the feature
  • reused Surrogate223Test (which is now green) instead of creating new test method

I've renamed the new feature to COMBINE_UNICODE_SURROGATES which I believe will better describe its purpose. I've also put the feature only to JsonGenerator (previously I put it also in StreamWriteFeature, but that would make it kind of duplicate).

Thank you for you review. If you have any more comments, just let me know.

@cowtowncoder cowtowncoder added the cla-needed PR looks good (although may also require code review), but CLA needed from submitter label Sep 16, 2024
@cowtowncoder
Copy link
Member

@rnetuka Just realized I don't have a CLA from you. However, I do have corporate-CLA from Red Hat -- and it looks like you are employed by Red Hat. If this is the case, I think you are covered and we are good.
Please let me know if this is the case (will remove "cla-needed" label).

Also: I see failed unit tests -- is this PR still work-in-progress?
(I had to approve CI run since this is your first contribution, hopefully it will now re-run on all pushes)

@rnetuka
Copy link
Contributor Author

rnetuka commented Sep 16, 2024

@cowtowncoder Yes, I'm Red Hat employee.

As for the test, I've noticed Surrogate223Test#surrogatesCharBacked is written for an option to escape the surrogate pairs for StringWriter/WriterBasedJsonGenerator. WriterBasedJsonGenerator always combines the surrogate pairs together, so I've removed the latter part of the test. Please, take a look and confirm this is ok for you.

To sum it up, if merged at this point:
UTF8JsonGenerator escapes 4-byte characters by default
UTF8JsonGenerator can be configured to combine the bytes instead
WriterBasedJsonGenerator always combines the bytes together

Unless you request other changes, I consider this PR done.

@cowtowncoder cowtowncoder removed the cla-needed PR looks good (although may also require code review), but CLA needed from submitter label Sep 16, 2024
private int _outputSurrogatePair(char highSurrogate, char lowSurrogate, int outputPtr) {
String s = String.valueOf(highSurrogate) + lowSurrogate;
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
System.arraycopy(bytes, 0, _outputBuffer, outputPtr, bytes.length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes. This is rather wasteful way of doing this... Let me think we really shouldn't have do too this much allocation of things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would

private int _outputSurrogatePair(char highSurrogate, char lowSurrogate, int outputPtr) {
        int unicode = 0x10000;
        unicode += (highSurrogate & 0x03FF) << 10;
        unicode += (lowSurrogate & 0x03FF);

        byte[] utf8 = new byte[4];
        utf8[0] = (byte) (0b11110000 + ((unicode & 0b00000000_00011100_00000000_00000000) >> 18));
        utf8[1] = (byte) (0b10000000 + ((unicode & 0b00000000_00000011_11110000_00000000) >> 12));
        utf8[2] = (byte) (0b10000000 + ((unicode & 0b00000000_00000000_00001111_11000000) >> 6));
        utf8[3] = (byte) (0b10000000 + (unicode & 0b00000000_00000000_00000000_00111111));

        System.arraycopy(utf8, 0, _outputBuffer, outputPtr, utf8.length);
        return outputPtr + utf8.length;
}

be better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. :)

Can also eliminate temporary buffer, assign directly in _outputBuffer.

Thank you for working with me here!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied with minor modifications; also realized unit test was not verifying actual round-trip encode/decode, added. Now catches some of problems I tried.
(ideally we'd have more tests for surrogate pairs but will have to do for now).

Will merge.

@cowtowncoder
Copy link
Member

Thank you for contributing this, @rnetuka ! Will merge in 2.18 for inclusion in 2.18.0 release.

@cowtowncoder cowtowncoder merged commit 4d47aae into FasterXML:2.18 Sep 18, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants