feat(rust): support simd approach in converting utf16 to utf8 #1778

urlyy · 2024-07-29T17:21:00Z

What does this PR do?

For the conversion from UTF-16 to UTF-8, a SIMD method based on AVX/SSE/NEON instruction sets was added on the basis of #1730 , and benchmarks were written.

referencing

Notice：

I use two precomputing table , as same as what have done in simdutf. But it takes 1600 lines.
I copied two utf8 encoded text file as into rust project for benchmark.
util.rs might need to be merged with string_util.rs
The util.rs code might be too long for you. The algorithm is first splitting utf16 bytes into chunks ,then converting a 256/128 bits chunk to utf8 bytes at one time, not using loop (except case-4), for 4 cases
1. for all utf16 in chunk, 1 utf16 -> 1 utf8
2. for all utf16 in chunk, 1 utf16 -> 1/2 utf8
3. for all utf16 in chunk, 1 utf16 -> 1/2/3 utf8
4. for all utf16 in chunk, 1 utf16 -> 1/2/3 utf8,or 2 utf16 -> 4 utf8

Example:
First we use bitwise operations included in SIMD to convert 0x[ 00ab pqrs ] -> 0x[ 00ab qwer ] . Assume that 00ab is for 1 utf16 -> 1 utf8 , pqrs is for 1 utf16 -> 2 utf8
Then we should remove the unneeded 00 , with shuffle: array = table[idx][1...]=[1,2,3,0](not correspond to real data), we convert 0x[ 00ab qwer ] --simd_shuffle_func--> 0x[abqw erxx]. And as we get final_length=table[idx][0]=3 , we can set len=len+3 ,not 4. Althouth we actually store 0x[abqw erxx] , the pre-allocating has ensure no index out of bounds.

Related issues

Does this PR introduce any user-facing change?

Does this PR introduce any public API change?
Does this PR introduce any binary protocol compatibility change?

Benchmark

dataset from https://github.com/lemire/unicode_lipsum/tree/main/wikipedia_mars

Both SIMD and non-SIMD approach are faster than using String::from_utf16(bytes).In my win11 x86 machine benchmark , SIMD approach seems to be approximately only a little faster than normal approach , that is out of my expectation. AVX seems better than SSE because AVX handle 256bit at one time but SSE onlyt handle 128 bits at one time. When handling with surrogate pair, algorithm will use fall_back (normal, without SIMD) way, in this case simd approach might be worse than normal way.

chaokunyang · 2024-07-30T03:57:47Z

This is too huge, do we have a better way to implement it @theweipeng @kitty-eu-org

chaokunyang · 2024-07-30T04:00:44Z

rust/fury/benches/simd_utf16_to_utf8.rs

+    let current_dir = env::current_dir()
+        .expect("Failed to get current directory")
+        .join("benches");
+    let path1 = current_dir.join("chinese.utf8.txt");


Seems it's only used here, could we use a smaller test data? I think two or three string line of literal is enough for benchmark

OK, I'll replace it with some randomly generated string

urlyy · 2024-07-30T05:20:12Z

This is too huge, do we have a better way to implement it @theweipeng @kitty-eu-org

I also think about it. Actually I have seen the CPP version implementation in fury: https://github.com/apache/fury/pull/1732/files , but the SIMD API seems only used at swap endian and checking whether all utf16 of a chunk are all can be converted to 1 or 4 bytes utf8, and then just use loop to convert the utf16 bytes in a chunk one by one. I don't think it has a good use of SIMD. ( not a blame, just a bit of confusion. As the benchmark in cpp shows that truly faster 1 times than std library )

// cpp version
if (_mm256_testz_si256(mask1, mask1)) {
      // All values < 0x80, 1 byte per character
      for (int j = 0; j < 16; ++j) {
        *output++ = static_cast<char>(utf16[i + j]);
      }
    }

rust version

 // check chunk's all u16 less than 0x80
if _mm256_testz_si256(chunk, *M_FF80) != 0 {
    let utf8_packed = _mm_packus_epi16(
        _mm256_castsi256_si128(chunk),
        _mm256_extractf128_si256(chunk, 1),
    );
    _mm_storeu_si128(ptr_8.add(offset_8) as *mut __m128i, utf8_packed);
    offset_8 += CHUNK_UTF16_USAGE;
    offset_16 += CHUNK_UTF16_USAGE;
    continue;
}

It is easy to shorten the code . Just remove some functions. And I just write a short-version using avx like below. Benchmark data is also as follows.

#[target_feature(enable = "avx", enable = "avx2", enable = "sse2")]
pub unsafe fn utf16_to_utf8_only_check(
    utf16: &[u16],
    is_little_endian: bool,
) -> Result<String, String> {
    let mut utf8_bytes: Vec<u8> = Vec::with_capacity(utf16.len() * 3);
    let ptr_8 = utf8_bytes.as_mut_ptr();
    let ptr_16 = utf16.as_ptr();
    let mut offset_8 = 0;
    let mut offset_16 = 0;
    let len_16 = utf16.len();
    while offset_16 + CHUNK_UTF16_USAGE <= len_16 {
        let mut chunk = _mm256_loadu_si256(ptr_16.add(offset_16) as *const __m256i);
        chunk = if is_little_endian == super::super::IS_LITTLE_ENDIAN_LOCAL {
            chunk
        } else {
            _mm256_shuffle_epi8(chunk, *ENDIAN_SWAP_MASK)
        };
        let mask1 = _mm256_cmpgt_epi16(chunk, *limit1);
        // check chunk's all u16 less than 0x80 ,1 utf16 -> 1utf8
        if _mm256_testz_si256(mask1, mask1) != 0 {
            let utf8_packed = _mm_packus_epi16(
                _mm256_castsi256_si128(chunk),
                _mm256_extractf128_si256(chunk, 1),
            );
            _mm_storeu_si128(ptr_8.add(offset_8) as *mut __m128i, utf8_packed);
            offset_8 += CHUNK_UTF16_USAGE;
            offset_16 += CHUNK_UTF16_USAGE;
            continue;
        }
        // when has some utf16 can convert to 2/3/4 utf8 bytes 
        let res = call_fallback(
            ptr_16,
            ptr_8,
            &mut offset_16,
            &mut offset_8,
            len_16,
            is_little_endian,
        );
        if let Some(err_msg) = res {
            return Err(err_msg);
        }
    }
    // dealing with remaining u16 not enough to form a chunk.
    if offset_16 < len_16 {
        let suffix_utf16 =
            std::slice::from_raw_parts(ptr_16.add(offset_16), len_16 - offset_16);
        let res = super::super::utf16_to_utf8_fallback(
            suffix_utf16,
            ptr_8.add(offset_8),
            is_little_endian,
        );
        if res.is_err() {
            return Err(res.err().unwrap());
        }
        offset_8 += res.unwrap();
    }
    utf8_bytes.set_len(offset_8);
    Ok(String::from_utf8(utf8_bytes).unwrap())
}

By the way,I fully support shortening the code. This huge code is too ugly for fury.😄

kitty-eu-org · 2024-07-30T10:33:36Z

@urlyy I wrote a rust version, the neon performance improvement is about twice as good, we can optimize it together

my "Standard" is your normal

kitty-eu-org · 2024-07-30T10:41:31Z

@urlyy I'm optimizing SIMD code for SSE2 and AVX

kitty-eu-org · 2024-07-30T10:44:19Z

@urlyy neon I referred to part of your implementation, thank you

urlyy · 2024-07-30T10:49:38Z

@urlyy neon I referred to part of your implementation, thank you

I'm new to SIMD, and my implementation is referenced from simdutf. I'd love to see further optimizations for this part of the code and learn something new from you🤓

urlyy added 2 commits July 30, 2024 01:01

feat(rust): support simd utf16 to utf8

181084d

feat(rust): support simd approach for utf16 to utf8

d32b799

urlyy requested review from theweipeng and chaokunyang as code owners July 29, 2024 17:21

chaokunyang reviewed Jul 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rust): support simd approach in converting utf16 to utf8 #1778

feat(rust): support simd approach in converting utf16 to utf8 #1778

urlyy commented Jul 29, 2024 •

edited

Loading

chaokunyang commented Jul 30, 2024

chaokunyang Jul 30, 2024

urlyy Jul 30, 2024

urlyy commented Jul 30, 2024 •

edited

Loading

kitty-eu-org commented Jul 30, 2024 •

edited

Loading

kitty-eu-org commented Jul 30, 2024 •

edited

Loading

kitty-eu-org commented Jul 30, 2024

urlyy commented Jul 30, 2024

feat(rust): support simd approach in converting utf16 to utf8 #1778

Are you sure you want to change the base?

feat(rust): support simd approach in converting utf16 to utf8 #1778

Conversation

urlyy commented Jul 29, 2024 • edited Loading

What does this PR do?

Related issues

Does this PR introduce any user-facing change?

Benchmark

chaokunyang commented Jul 30, 2024

chaokunyang Jul 30, 2024

Choose a reason for hiding this comment

urlyy Jul 30, 2024

Choose a reason for hiding this comment

urlyy commented Jul 30, 2024 • edited Loading

kitty-eu-org commented Jul 30, 2024 • edited Loading

kitty-eu-org commented Jul 30, 2024 • edited Loading

kitty-eu-org commented Jul 30, 2024

urlyy commented Jul 30, 2024

urlyy commented Jul 29, 2024 •

edited

Loading

urlyy commented Jul 30, 2024 •

edited

Loading

kitty-eu-org commented Jul 30, 2024 •

edited

Loading

kitty-eu-org commented Jul 30, 2024 •

edited

Loading