-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Fix UnicodeUtil#truncateStringMax returns malformed string. #11161
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -202,11 +202,20 @@ public void testTruncateStringMax() { | |
String test5 = "\uDBFF\uDFFF\uDBFF\uDFFF"; | ||
String test6 = "\uD800\uDFFF\uD800\uDFFF"; | ||
// Increment the previous character | ||
String test6_2_expected = "\uD801\uDC00"; | ||
String test6_1_expected = "\uD801\uDC00"; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this should be a typo, "\uD800\uDFFF" is a Unicode surrogate pair, it's length is 1. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm confused on this one, why is this one changed? Why did the test pass before? |
||
String test7 = "\uD83D\uDE02\uD83D\uDE02\uD83D\uDE02"; | ||
String test7_2_expected = "\uD83D\uDE02\uD83D\uDE03"; | ||
String test7_1_expected = "\uD83D\uDE03"; | ||
|
||
// Increment the max UTF-8 character will overflow | ||
String test8 = "a\uDBFF\uDFFFc"; | ||
String test8_2_expected = "b"; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The characters in
|
||
|
||
// Increment skip invalid Unicode scalar values [Character.MIN_SURROGATE, | ||
// Character.MAX_SURROGATE] | ||
String test9 = "a" + (char) (Character.MIN_SURROGATE - 1) + "b"; | ||
String test9_2_expected = "a" + (char) (Character.MAX_SURROGATE + 1); | ||
|
||
Comparator<CharSequence> cmp = Literal.of(test1).comparator(); | ||
assertThat(cmp.compare(truncateStringMax(Literal.of(test1), 4).value(), test1)) | ||
.as("Truncated upper bound should be greater than or equal to the actual upper bound") | ||
|
@@ -254,10 +263,10 @@ public void testTruncateStringMax() { | |
assertThat(truncateStringMax(Literal.of(test5), 1)) | ||
.as("An upper bound doesn't exist since the first two characters are max UTF-8 characters") | ||
.isNull(); | ||
assertThat(cmp.compare(truncateStringMax(Literal.of(test6), 2).value(), test6)) | ||
assertThat(cmp.compare(truncateStringMax(Literal.of(test6), 1).value(), test6)) | ||
.as("Truncated upper bound should be greater than or equal to the actual upper bound") | ||
.isGreaterThanOrEqualTo(0); | ||
assertThat(cmp.compare(truncateStringMax(Literal.of(test6), 1).value(), test6_2_expected)) | ||
assertThat(cmp.compare(truncateStringMax(Literal.of(test6), 1).value(), test6_1_expected)) | ||
.as( | ||
"Test 4 byte UTF-8 character increment. Output must have one character with " | ||
+ "the first character incremented") | ||
|
@@ -273,5 +282,24 @@ public void testTruncateStringMax() { | |
.as( | ||
"Test input with multiple 4 byte UTF-8 character where the first unicode character should be incremented") | ||
.isEqualTo(0); | ||
|
||
assertThat(cmp.compare(truncateStringMax(Literal.of(test8), 2).value(), test8)) | ||
.as("Truncated upper bound should be greater than or equal to the actual upper bound") | ||
.isGreaterThanOrEqualTo(0); | ||
assertThat(cmp.compare(truncateStringMax(Literal.of(test8), 2).value(), test8_2_expected)) | ||
.as( | ||
"Test the last character is the 4-byte max UTF-8 character after truncated where the second-to-last " | ||
+ "character should be incremented") | ||
.isEqualTo(0); | ||
|
||
assertThat(cmp.compare(truncateStringMax(Literal.of(test9), 2).value(), test9)) | ||
.as("Truncated upper bound should be greater than or equal to the actual upper bound") | ||
.isGreaterThanOrEqualTo(0); | ||
|
||
assertThat(cmp.compare(truncateStringMax(Literal.of(test9), 2).value(), test9_2_expected)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is missing an actuall assertion check There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed. |
||
.as( | ||
"Test the last character is `Character.MIN_SURROGATE - 1` after truncated, it should be incremented to " | ||
+ "next valid Unicode scalar value `Character.MAX_SURROGATE + 1`") | ||
.isEqualTo(0); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor point here, but shouldn't this only be relevant if we somehow get non-unicode binary in a unicode string? Shouldn't be possible in a Java string right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible for Java strings to contain only one unpaired surrogate character(non-unicode character), though encoding them using UTF-8 or UTF-16 will result in
MalformedInputException
. This is also the case in this issue, where the truncation method returns a string ending with an unpaired high-surrogate character, but fails when encoding it to UTF-8.For a valid UTF-8 string, it will not contain unpaired surrogates. However, the
codePointAt
method may return a unpaired surrogate code point if an incorrect index is passed.Currently, all methods in the
UnicodeUtil
class that usecodePointAt
are correct and will not result in an unpaired surrogate code point. I added it to strengthen the validation.