Are Whisper tokens actually guaranteed to be valid UTF-8 strings? #46

jcsoo · 2023-05-02T22:45:31Z

WhisperContext::token_to_str currently calls CStr::to_str which will return an error if the contents of the CStr are not valid UTF-8. Is there any guarantee that individual Whisper tokens are actually UTF-8?

If not, it might be helpful to provide a variant of this function that would return the CStr so that the caller could decide what to do with the token (token_to_str_raw()? token_to_cstr()?).

Whisper could possibly generate a series of tokens that might individually be invalid UTF-8 but could be concatenated to produce a valid String. And, in cases where the resulting string is still not valid UTF-8, the caller may want to decide whether to fail or to use to_string_lossy().

The text was updated successfully, but these errors were encountered:

tazz4843 · 2023-05-04T19:24:03Z

I didn't go to double-check, just copied from similar functions. It may be a good idea to verify. PRs are welcomed.

jbg · 2023-05-19T23:58:21Z

They're not guaranteed to be valid UTF-8: avstack/gst-whisper#1

Also affects full_get_segment_text().

Rather than leaking the CStr FFI type in the API, maybe change the existing functions that return String to internally go CStr -> &[u8] -> (lossy) -> String so that they don't error out on bad UTF-8, and provide adjacent functions that return the bytes as a Vec<u8> so the caller can access the raw bytes if desired?

jcsoo · 2023-05-24T00:48:35Z

Adding token_to_bytes and full_get_segment_bytes() is easy, and I agree that it would be better not to leak the CStr.

token_to_str is trickier because it currently attempts to return a &str. Doing a lossy conversion would require an allocation and returning a String. On the other hand, full_get_segment_text could be changed to do a lossy conversion without any problems.

Maybe token_to_str should be replaced by token_to_string to support lossy conversion?

tazz4843 · 2024-04-06T17:38:47Z

Should be solved in f4ea0d9

jbg mentioned this issue May 19, 2023

Whisper can produce invalid UTF-8 avstack/gst-whisper#1

Open

travolin mentioned this issue Aug 18, 2023

Add access to segment text as bytes #79

Merged

tazz4843 closed this as completed Apr 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are Whisper tokens actually guaranteed to be valid UTF-8 strings? #46

Are Whisper tokens actually guaranteed to be valid UTF-8 strings? #46

jcsoo commented May 2, 2023

tazz4843 commented May 4, 2023

jbg commented May 19, 2023 •

edited

Loading

jcsoo commented May 24, 2023

tazz4843 commented Apr 6, 2024

Are Whisper tokens actually guaranteed to be valid UTF-8 strings? #46

Are Whisper tokens actually guaranteed to be valid UTF-8 strings? #46

Comments

jcsoo commented May 2, 2023

tazz4843 commented May 4, 2023

jbg commented May 19, 2023 • edited Loading

jcsoo commented May 24, 2023

tazz4843 commented Apr 6, 2024

jbg commented May 19, 2023 •

edited

Loading