Core: Fix UnicodeUtil#truncateStringMax returns malformed string. #11161

zhongyujiang · 2024-09-18T14:32:39Z

We encountered an exception while writing data, and the stack trace is as follows. It occurred during the collection of Parquet column metrics:

Exception stack:

Suppressed: org.apache.iceberg.exceptions.RuntimeIOException: Failed to encode value as UTF-8: Ҋ�Qڞ<֔~�M�ECڮV?
at org.apache.iceberg.types.Conversions.toByteBuffer(Conversions.java:110)
at org.apache.iceberg.types.Conversions.toByteBuffer(Conversions.java:83)
at org.apache.iceberg.parquet.ParquetUtil.toBufferMap(ParquetUtil.java:343)
at org.apache.iceberg.parquet.ParquetUtil.footerMetrics(ParquetUtil.java:174)
at org.apache.iceberg.parquet.ParquetUtil.footerMetrics(ParquetUtil.java:86)
at org.apache.iceberg.parquet.ParquetWriter.metrics(ParquetWriter.java:166)
at org.apache.iceberg.io.DataWriter.close(DataWriter.java:100)
at org.apache.iceberg.io.RollingFileWriter.closeCurrentWriter(RollingFileWriter.java:122)
at org.apache.iceberg.io.RollingFileWriter.close(RollingFileWriter.java:147)
at org.apache.iceberg.io.RollingDataWriter.close(RollingDataWriter.java:32)
at org.apache.iceberg.io.FanoutWriter.closeWriters(FanoutWriter.java:82)
at org.apache.iceberg.io.FanoutWriter.close(FanoutWriter.java:74)
at org.apache.iceberg.io.FanoutDataWriter.close(FanoutDataWriter.java:31)
at org.apache.iceberg.spark.source.SparkWrite$PartitionedDataWriter.close(SparkWrite.java:1162)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$9(WriteToDataSourceV2Exec.scala:423)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1496)
... 10 more
Caused by: java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:816)
at org.apache.iceberg.types.Conversions.toByteBuffer(Conversions.java:108)
... 25 more
Caused by: java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:816)
at org.apache.iceberg.types.Conversions.toByteBuffer(Conversions.java:108)

Investigation

After some investigation, I found that when collecting Parquet column metrics, string metrics are truncated by default to a length of 16 characters. When truncating the max metric, if the truncated length is less than the length of the original max value, the last character will be incremented by 1 to ensure that the truncated value is greater than the max value. However, this increment operation does not consider skipping illegal UTF-8 Unicode code points, which may lead to the following exception.

In the scenario where we encountered this issue, there is a Parquet file with a column's max metric length exceeding 16, and the code point of its 16th character is '\uD7FF', which is Character.MIN_SURROGATE - 1. Adding 1 to this resulted in Character.MIN_SURROGATE, which is not a valid Unicode scalar value. Therefore, when Conversions.toByteBuffer attempted to encode it in UTF-8 format, a MalformedInputException was thrown.

This fix specifically skips illegal code points when incrementing the last character to avoid this issue.

To reproduce

CREATE TABLE my_table (data string) using iceberg;
INSERT INTO my_table VALUES('abcdefghigklmno\uD7FFp');

zhongyujiang · 2024-09-19T03:21:47Z

@amogh-jahagirdar @nastra can you please help review this? thanks.

RussellSpitzer · 2024-09-19T21:15:21Z

api/src/main/java/org/apache/iceberg/util/UnicodeUtil.java

+    // surrogate code points are not Unicode scalar values,
+    // any UTF-8 byte sequence that would otherwise map to code points U+D800..U+DFFF is ill-formed.
+    // see https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G27288
+    Preconditions.checkArgument(


Just a minor point here, but shouldn't this only be relevant if we somehow get non-unicode binary in a unicode string? Shouldn't be possible in a Java string right?

It is possible for Java strings to contain only one unpaired surrogate character(non-unicode character), though encoding them using UTF-8 or UTF-16 will result in MalformedInputException. This is also the case in this issue, where the truncation method returns a string ending with an unpaired high-surrogate character, but fails when encoding it to UTF-8.

For a valid UTF-8 string, it will not contain unpaired surrogates. However, the codePointAt method may return a unpaired surrogate code point if an incorrect index is passed.

/**
* Returns the character (Unicode code point) at the specified
* index. The index refers to {@code char} values
* (Unicode code units) and ranges from {@code 0} to
* {@link #length()}{@code - 1}.
*
*
If the {@code char} value specified at the given index
* is in the high-surrogate range, the following index is less
* than the length of this sequence, and the
* {@code char} value at the following index is in the
* low-surrogate range, then the supplementary code point
* corresponding to this surrogate pair is returned. Otherwise,
* the {@code char} value at the given index is returned.
*
* @param index the index to the {@code char} values
* @return the code point value of the character at the
* {@code index}
* @throws IndexOutOfBoundsException if the {@code index}
* argument is negative or not less than the length of this
* sequence.
*/
public int codePointAt(int index) {

Currently, all methods in the UnicodeUtil class that use codePointAt are correct and will not result in an unpaired surrogate code point. I added it to strengthen the validation.

RussellSpitzer · 2024-09-19T21:19:21Z

core/src/test/java/org/apache/iceberg/TestMetricsTruncation.java

@@ -274,4 +274,17 @@ public void testTruncateStringMax() {
            "Test input with multiple 4 byte UTF-8 character where the first unicode character should be incremented")
        .isEqualTo(0);
  }
+
+  @Test
+  public void testTruncateStringMaxUpperBound() {


Could we add these to the test above? I'm also fine if there is a specific reason to have them somewhere else but it seems like these would fit into the test above as just other examples. The test cases for +1 and MAX_CODE_POINT are there already right?

Moved this to testTruncateStringMax

The test cases for +1 and MAX_CODE_POINT are there already right?
Yes, there is already one test case for MAX_CODE_POINT

RussellSpitzer

I just have a few nits on this, but I this makes sense to me

zhongyujiang

@RussellSpitzer Thanks for reviewing, I've updated the PR to resolve your comments, please take a took when you have time.

zhongyujiang · 2024-09-23T04:39:33Z

core/src/test/java/org/apache/iceberg/TestMetricsTruncation.java

@@ -202,11 +202,20 @@ public void testTruncateStringMax() {
    String test5 = "\uDBFF\uDFFF\uDBFF\uDFFF";
    String test6 = "\uD800\uDFFF\uD800\uDFFF";
    // Increment the previous character
-    String test6_2_expected = "\uD801\uDC00";
+    String test6_1_expected = "\uD801\uDC00";


I think this should be a typo, "\uD800\uDFFF" is a Unicode surrogate pair, it's length is 1.

I'm confused on this one, why is this one changed? Why did the test pass before?

zhongyujiang · 2024-09-23T04:44:36Z

core/src/test/java/org/apache/iceberg/TestMetricsTruncation.java

    String test7 = "\uD83D\uDE02\uD83D\uDE02\uD83D\uDE02";
    String test7_2_expected = "\uD83D\uDE02\uD83D\uDE03";
    String test7_1_expected = "\uD83D\uDE03";

+    // Increment the max UTF-8 character will overflow
+    String test8 = "a\uDBFF\uDFFFc";
+    String test8_2_expected = "b";


The characters in test5 are all MAX_CODE_POINT, so the upper bound does not exist.

test8 adds a case where an overflow occurs due to MAX_CODE_POINT, but it is possible to increment the previous character to get an upper bound.

nastra · 2024-09-23T06:04:25Z

core/src/test/java/org/apache/iceberg/TestMetricsTruncation.java

+        .as("Truncated upper bound should be greater than or equal to the actual upper bound")
+        .isGreaterThanOrEqualTo(0);
+
+    assertThat(cmp.compare(truncateStringMax(Literal.of(test9), 2).value(), test9_2_expected))


this is missing an actuall assertion check

zhongyujiang · 2024-10-05T10:04:24Z

Hey @RussellSpitzer @nastra could you please review this again when you have time?

Core: Fix UnicodeUtil#truncateStringMax returns malformed string.

0760a35

github-actions bot added API core labels Sep 18, 2024

Fix style.

06ea0b1

RussellSpitzer reviewed Sep 19, 2024

View reviewed changes

Comments.

baa2ed0

zhongyujiang commented Sep 23, 2024

View reviewed changes

nastra reviewed Sep 23, 2024

View reviewed changes

Fix assert.

a7b92ea

RussellSpitzer approved these changes Oct 7, 2024

View reviewed changes

RussellSpitzer merged commit 3220fad into apache:main Oct 7, 2024
50 checks passed

zhongyujiang deleted the dev/fix-metric-malformed branch October 8, 2024 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Fix UnicodeUtil#truncateStringMax returns malformed string. #11161

Core: Fix UnicodeUtil#truncateStringMax returns malformed string. #11161

zhongyujiang commented Sep 18, 2024 •

edited

Loading

zhongyujiang commented Sep 19, 2024

RussellSpitzer Sep 19, 2024

zhongyujiang Sep 23, 2024

RussellSpitzer Sep 19, 2024

zhongyujiang Sep 23, 2024

RussellSpitzer left a comment

zhongyujiang left a comment

zhongyujiang Sep 23, 2024

RussellSpitzer Oct 7, 2024

zhongyujiang Sep 23, 2024

nastra Sep 23, 2024

zhongyujiang Sep 23, 2024

zhongyujiang commented Oct 5, 2024

Core: Fix UnicodeUtil#truncateStringMax returns malformed string. #11161

Core: Fix UnicodeUtil#truncateStringMax returns malformed string. #11161

Conversation

zhongyujiang commented Sep 18, 2024 • edited Loading

Exception stack:

Investigation

To reproduce

zhongyujiang commented Sep 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer left a comment

Choose a reason for hiding this comment

zhongyujiang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhongyujiang commented Oct 5, 2024

zhongyujiang commented Sep 18, 2024 •

edited

Loading