Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Fix UnicodeUtil#truncateStringMax returns malformed string. #11161

Merged
merged 4 commits into from
Oct 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions api/src/main/java/org/apache/iceberg/types/Comparators.java
Original file line number Diff line number Diff line change
Expand Up @@ -321,9 +321,9 @@ private CharSeqComparator() {}
* represented using two Java characters (using UTF-16 surrogate pairs). Character by character
* comparison may yield incorrect results while comparing a 4 byte UTF-8 character to a java
* char. Character by character comparison works as expected if both characters are <= 3 byte
* UTF-8 character or both characters are 4 byte UTF-8 characters.
* isCharInUTF16HighSurrogateRange method detects a 4-byte character and considers that
* character to be lexicographically greater than any 3 byte or lower UTF-8 character.
* UTF-8 character or both characters are 4 byte UTF-8 characters. isCharHighSurrogate method
* detects a high surrogate (4-byte character) and considers that character to be
* lexicographically greater than any 3 byte or lower UTF-8 character.
*/
@Override
public int compare(CharSequence s1, CharSequence s2) {
Expand Down
24 changes: 22 additions & 2 deletions api/src/main/java/org/apache/iceberg/util/UnicodeUtil.java
Original file line number Diff line number Diff line change
Expand Up @@ -82,9 +82,9 @@ public static Literal<CharSequence> truncateStringMax(Literal<CharSequence> inpu
for (int i = length - 1; i >= 0; i--) {
// Get the offset in the truncated string buffer where the number of unicode characters = i
int offsetByCodePoint = truncatedStringBuilder.offsetByCodePoints(0, i);
int nextCodePoint = truncatedStringBuilder.codePointAt(offsetByCodePoint) + 1;
int nextCodePoint = incrementCodePoint(truncatedStringBuilder.codePointAt(offsetByCodePoint));
// No overflow
if (nextCodePoint != 0 && Character.isValidCodePoint(nextCodePoint)) {
if (nextCodePoint != 0) {
truncatedStringBuilder.setLength(offsetByCodePoint);
// Append next code point to the truncated substring
truncatedStringBuilder.appendCodePoint(nextCodePoint);
Expand All @@ -93,4 +93,24 @@ public static Literal<CharSequence> truncateStringMax(Literal<CharSequence> inpu
}
return null; // Cannot find a valid upper bound
}

private static int incrementCodePoint(int codePoint) {
// surrogate code points are not Unicode scalar values,
// any UTF-8 byte sequence that would otherwise map to code points U+D800..U+DFFF is ill-formed.
// see https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G27288
Preconditions.checkArgument(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor point here, but shouldn't this only be relevant if we somehow get non-unicode binary in a unicode string? Shouldn't be possible in a Java string right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible for Java strings to contain only one unpaired surrogate character(non-unicode character), though encoding them using UTF-8 or UTF-16 will result in MalformedInputException. This is also the case in this issue, where the truncation method returns a string ending with an unpaired high-surrogate character, but fails when encoding it to UTF-8.

For a valid UTF-8 string, it will not contain unpaired surrogates. However, the codePointAt method may return a unpaired surrogate code point if an incorrect index is passed.

/**
* Returns the character (Unicode code point) at the specified
* index. The index refers to {@code char} values
* (Unicode code units) and ranges from {@code 0} to
* {@link #length()}{@code - 1}.
*
*

If the {@code char} value specified at the given index
* is in the high-surrogate range, the following index is less
* than the length of this sequence, and the
* {@code char} value at the following index is in the
* low-surrogate range, then the supplementary code point
* corresponding to this surrogate pair is returned. Otherwise,
* the {@code char} value at the given index is returned.
*
* @param index the index to the {@code char} values
* @return the code point value of the character at the
* {@code index}
* @throws IndexOutOfBoundsException if the {@code index}
* argument is negative or not less than the length of this
* sequence.
*/
public int codePointAt(int index) {

Currently, all methods in the UnicodeUtil class that use codePointAt are correct and will not result in an unpaired surrogate code point. I added it to strengthen the validation.

codePoint < Character.MIN_SURROGATE || codePoint > Character.MAX_SURROGATE,
"invalid code point: %s",
codePoint);

if (codePoint == Character.MIN_SURROGATE - 1) {
// increment to the next Unicode scalar value
return Character.MAX_SURROGATE + 1;
} else if (codePoint == Character.MAX_CODE_POINT) {
// overflow
return 0;
} else {
return codePoint + 1;
}
}
}
34 changes: 31 additions & 3 deletions core/src/test/java/org/apache/iceberg/TestMetricsTruncation.java
Original file line number Diff line number Diff line change
Expand Up @@ -202,11 +202,20 @@ public void testTruncateStringMax() {
String test5 = "\uDBFF\uDFFF\uDBFF\uDFFF";
String test6 = "\uD800\uDFFF\uD800\uDFFF";
// Increment the previous character
String test6_2_expected = "\uD801\uDC00";
String test6_1_expected = "\uD801\uDC00";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a typo, "\uD800\uDFFF" is a Unicode surrogate pair, it's length is 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused on this one, why is this one changed? Why did the test pass before?

String test7 = "\uD83D\uDE02\uD83D\uDE02\uD83D\uDE02";
String test7_2_expected = "\uD83D\uDE02\uD83D\uDE03";
String test7_1_expected = "\uD83D\uDE03";

// Increment the max UTF-8 character will overflow
String test8 = "a\uDBFF\uDFFFc";
String test8_2_expected = "b";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The characters in test5 are all MAX_CODE_POINT, so the upper bound does not exist.

test8 adds a case where an overflow occurs due to MAX_CODE_POINT, but it is possible to increment the previous character to get an upper bound.


// Increment skip invalid Unicode scalar values [Character.MIN_SURROGATE,
// Character.MAX_SURROGATE]
String test9 = "a" + (char) (Character.MIN_SURROGATE - 1) + "b";
String test9_2_expected = "a" + (char) (Character.MAX_SURROGATE + 1);

Comparator<CharSequence> cmp = Literal.of(test1).comparator();
assertThat(cmp.compare(truncateStringMax(Literal.of(test1), 4).value(), test1))
.as("Truncated upper bound should be greater than or equal to the actual upper bound")
Expand Down Expand Up @@ -254,10 +263,10 @@ public void testTruncateStringMax() {
assertThat(truncateStringMax(Literal.of(test5), 1))
.as("An upper bound doesn't exist since the first two characters are max UTF-8 characters")
.isNull();
assertThat(cmp.compare(truncateStringMax(Literal.of(test6), 2).value(), test6))
assertThat(cmp.compare(truncateStringMax(Literal.of(test6), 1).value(), test6))
.as("Truncated upper bound should be greater than or equal to the actual upper bound")
.isGreaterThanOrEqualTo(0);
assertThat(cmp.compare(truncateStringMax(Literal.of(test6), 1).value(), test6_2_expected))
assertThat(cmp.compare(truncateStringMax(Literal.of(test6), 1).value(), test6_1_expected))
.as(
"Test 4 byte UTF-8 character increment. Output must have one character with "
+ "the first character incremented")
Expand All @@ -273,5 +282,24 @@ public void testTruncateStringMax() {
.as(
"Test input with multiple 4 byte UTF-8 character where the first unicode character should be incremented")
.isEqualTo(0);

assertThat(cmp.compare(truncateStringMax(Literal.of(test8), 2).value(), test8))
.as("Truncated upper bound should be greater than or equal to the actual upper bound")
.isGreaterThanOrEqualTo(0);
assertThat(cmp.compare(truncateStringMax(Literal.of(test8), 2).value(), test8_2_expected))
.as(
"Test the last character is the 4-byte max UTF-8 character after truncated where the second-to-last "
+ "character should be incremented")
.isEqualTo(0);

assertThat(cmp.compare(truncateStringMax(Literal.of(test9), 2).value(), test9))
.as("Truncated upper bound should be greater than or equal to the actual upper bound")
.isGreaterThanOrEqualTo(0);

assertThat(cmp.compare(truncateStringMax(Literal.of(test9), 2).value(), test9_2_expected))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is missing an actuall assertion check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

.as(
"Test the last character is `Character.MIN_SURROGATE - 1` after truncated, it should be incremented to "
+ "next valid Unicode scalar value `Character.MAX_SURROGATE + 1`")
.isEqualTo(0);
}
}