Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are Whisper tokens actually guaranteed to be valid UTF-8 strings? #46

Closed
jcsoo opened this issue May 2, 2023 · 4 comments
Closed

Are Whisper tokens actually guaranteed to be valid UTF-8 strings? #46

jcsoo opened this issue May 2, 2023 · 4 comments

Comments

@jcsoo
Copy link
Contributor

jcsoo commented May 2, 2023

WhisperContext::token_to_str currently calls CStr::to_str which will return an error if the contents of the CStr are not valid UTF-8. Is there any guarantee that individual Whisper tokens are actually UTF-8?

If not, it might be helpful to provide a variant of this function that would return the CStr so that the caller could decide what to do with the token (token_to_str_raw()? token_to_cstr()?).

Whisper could possibly generate a series of tokens that might individually be invalid UTF-8 but could be concatenated to produce a valid String. And, in cases where the resulting string is still not valid UTF-8, the caller may want to decide whether to fail or to use to_string_lossy().

@tazz4843
Copy link
Owner

tazz4843 commented May 4, 2023

I didn't go to double-check, just copied from similar functions. It may be a good idea to verify. PRs are welcomed.

@jbg
Copy link

jbg commented May 19, 2023

They're not guaranteed to be valid UTF-8: avstack/gst-whisper#1

Also affects full_get_segment_text().

Rather than leaking the CStr FFI type in the API, maybe change the existing functions that return String to internally go CStr -> &[u8] -> (lossy) -> String so that they don't error out on bad UTF-8, and provide adjacent functions that return the bytes as a Vec<u8> so the caller can access the raw bytes if desired?

@jcsoo
Copy link
Contributor Author

jcsoo commented May 24, 2023

Adding token_to_bytes and full_get_segment_bytes() is easy, and I agree that it would be better not to leak the CStr.

token_to_str is trickier because it currently attempts to return a &str. Doing a lossy conversion would require an allocation and returning a String. On the other hand, full_get_segment_text could be changed to do a lossy conversion without any problems.

Maybe token_to_str should be replaced by token_to_string to support lossy conversion?

@tazz4843
Copy link
Owner

tazz4843 commented Apr 6, 2024

Should be solved in f4ea0d9

@tazz4843 tazz4843 closed this as completed Apr 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants