Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rust): support simd approach in converting utf16 to utf8 #1778

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

urlyy
Copy link
Contributor

@urlyy urlyy commented Jul 29, 2024

What does this PR do?

For the conversion from UTF-16 to UTF-8, a SIMD method based on AVX/SSE/NEON instruction sets was added on the basis of #1730 , and benchmarks were written.

referencing

Notice:

  • I use two precomputing table , as same as what have done in simdutf. But it takes 1600 lines.
  • I copied two utf8 encoded text file as into rust project for benchmark.
  • util.rs might need to be merged with string_util.rs
  • The util.rs code might be too long for you. The algorithm is first splitting utf16 bytes into chunks ,then converting a 256/128 bits chunk to utf8 bytes at one time, not using loop (except case-4), for 4 cases
    1. for all utf16 in chunk, 1 utf16 -> 1 utf8
    2. for all utf16 in chunk, 1 utf16 -> 1/2 utf8
    3. for all utf16 in chunk, 1 utf16 -> 1/2/3 utf8
    4. for all utf16 in chunk, 1 utf16 -> 1/2/3 utf8,or 2 utf16 -> 4 utf8

Example:
First we use bitwise operations included in SIMD to convert 0x[ 00ab pqrs ] -> 0x[ 00ab qwer ] . Assume that 00ab is for 1 utf16 -> 1 utf8 , pqrs is for 1 utf16 -> 2 utf8
Then we should remove the unneeded 00 , with shuffle: array = table[idx][1...]=[1,2,3,0](not correspond to real data), we convert 0x[ 00ab qwer ] --simd_shuffle_func--> 0x[abqw erxx]. And as we get final_length=table[idx][0]=3 , we can set len=len+3 ,not 4. Althouth we actually store 0x[abqw erxx] , the pre-allocating has ensure no index out of bounds.

Related issues

Does this PR introduce any user-facing change?

  • Does this PR introduce any public API change?
  • Does this PR introduce any binary protocol compatibility change?

Benchmark

dataset from https://github.com/lemire/unicode_lipsum/tree/main/wikipedia_mars

Both SIMD and non-SIMD approach are faster than using String::from_utf16(bytes).In my win11 x86 machine benchmark , SIMD approach seems to be approximately only a little faster than normal approach , that is out of my expectation. AVX seems better than SSE because AVX handle 256bit at one time but SSE onlyt handle 128 bits at one time. When handling with surrogate pair, algorithm will use fall_back (normal, without SIMD) way, in this case simd approach might be worse than normal way.
image

@chaokunyang
Copy link
Collaborator

This is too huge, do we have a better way to implement it @theweipeng @kitty-eu-org

let current_dir = env::current_dir()
.expect("Failed to get current directory")
.join("benches");
let path1 = current_dir.join("chinese.utf8.txt");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems it's only used here, could we use a smaller test data? I think two or three string line of literal is enough for benchmark

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll replace it with some randomly generated string

@urlyy
Copy link
Contributor Author

urlyy commented Jul 30, 2024

This is too huge, do we have a better way to implement it @theweipeng @kitty-eu-org

I also think about it. Actually I have seen the CPP version implementation in fury: https://github.com/apache/fury/pull/1732/files , but the SIMD API seems only used at swap endian and checking whether all utf16 of a chunk are all can be converted to 1 or 4 bytes utf8, and then just use loop to convert the utf16 bytes in a chunk one by one. I don't think it has a good use of SIMD. ( not a blame, just a bit of confusion. As the benchmark in cpp shows that truly faster 1 times than std library )

// cpp version
if (_mm256_testz_si256(mask1, mask1)) {
      // All values < 0x80, 1 byte per character
      for (int j = 0; j < 16; ++j) {
        *output++ = static_cast<char>(utf16[i + j]);
      }
    }

rust version

 // check chunk's all u16 less than 0x80
if _mm256_testz_si256(chunk, *M_FF80) != 0 {
    let utf8_packed = _mm_packus_epi16(
        _mm256_castsi256_si128(chunk),
        _mm256_extractf128_si256(chunk, 1),
    );
    _mm_storeu_si128(ptr_8.add(offset_8) as *mut __m128i, utf8_packed);
    offset_8 += CHUNK_UTF16_USAGE;
    offset_16 += CHUNK_UTF16_USAGE;
    continue;
}

It is easy to shorten the code . Just remove some functions. And I just write a short-version using avx like below. Benchmark data is also as follows.

#[target_feature(enable = "avx", enable = "avx2", enable = "sse2")]
pub unsafe fn utf16_to_utf8_only_check(
    utf16: &[u16],
    is_little_endian: bool,
) -> Result<String, String> {
    let mut utf8_bytes: Vec<u8> = Vec::with_capacity(utf16.len() * 3);
    let ptr_8 = utf8_bytes.as_mut_ptr();
    let ptr_16 = utf16.as_ptr();
    let mut offset_8 = 0;
    let mut offset_16 = 0;
    let len_16 = utf16.len();
    while offset_16 + CHUNK_UTF16_USAGE <= len_16 {
        let mut chunk = _mm256_loadu_si256(ptr_16.add(offset_16) as *const __m256i);
        chunk = if is_little_endian == super::super::IS_LITTLE_ENDIAN_LOCAL {
            chunk
        } else {
            _mm256_shuffle_epi8(chunk, *ENDIAN_SWAP_MASK)
        };
        let mask1 = _mm256_cmpgt_epi16(chunk, *limit1);
        // check chunk's all u16 less than 0x80 ,1 utf16 -> 1utf8
        if _mm256_testz_si256(mask1, mask1) != 0 {
            let utf8_packed = _mm_packus_epi16(
                _mm256_castsi256_si128(chunk),
                _mm256_extractf128_si256(chunk, 1),
            );
            _mm_storeu_si128(ptr_8.add(offset_8) as *mut __m128i, utf8_packed);
            offset_8 += CHUNK_UTF16_USAGE;
            offset_16 += CHUNK_UTF16_USAGE;
            continue;
        }
        // when has some utf16 can convert to 2/3/4 utf8 bytes 
        let res = call_fallback(
            ptr_16,
            ptr_8,
            &mut offset_16,
            &mut offset_8,
            len_16,
            is_little_endian,
        );
        if let Some(err_msg) = res {
            return Err(err_msg);
        }
    }
    // dealing with remaining u16 not enough to form a chunk.
    if offset_16 < len_16 {
        let suffix_utf16 =
            std::slice::from_raw_parts(ptr_16.add(offset_16), len_16 - offset_16);
        let res = super::super::utf16_to_utf8_fallback(
            suffix_utf16,
            ptr_8.add(offset_8),
            is_little_endian,
        );
        if res.is_err() {
            return Err(res.err().unwrap());
        }
        offset_8 += res.unwrap();
    }
    utf8_bytes.set_len(offset_8);
    Ok(String::from_utf8(utf8_bytes).unwrap())
}

image

By the way,I fully support shortening the code. This huge code is too ugly for fury.😄

@kitty-eu-org
Copy link
Contributor

kitty-eu-org commented Jul 30, 2024

@urlyy I wrote a rust version, the neon performance improvement is about twice as good, we can optimize it together

image

my "Standard" is your normal

@kitty-eu-org
Copy link
Contributor

kitty-eu-org commented Jul 30, 2024

@urlyy I'm optimizing SIMD code for SSE2 and AVX

@kitty-eu-org
Copy link
Contributor

@urlyy neon I referred to part of your implementation, thank you

@urlyy
Copy link
Contributor Author

urlyy commented Jul 30, 2024

@urlyy neon I referred to part of your implementation, thank you

I'm new to SIMD, and my implementation is referenced from simdutf. I'd love to see further optimizations for this part of the code and learn something new from you🤓

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants