-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Are Whisper tokens actually guaranteed to be valid UTF-8 strings? #46
Comments
I didn't go to double-check, just copied from similar functions. It may be a good idea to verify. PRs are welcomed. |
They're not guaranteed to be valid UTF-8: avstack/gst-whisper#1 Also affects Rather than leaking the |
Adding
Maybe |
Should be solved in f4ea0d9 |
WhisperContext::token_to_str
currently callsCStr::to_str
which will return an error if the contents of the CStr are not valid UTF-8. Is there any guarantee that individual Whisper tokens are actually UTF-8?If not, it might be helpful to provide a variant of this function that would return the CStr so that the caller could decide what to do with the token (
token_to_str_raw()
?token_to_cstr()
?).Whisper could possibly generate a series of tokens that might individually be invalid UTF-8 but could be concatenated to produce a valid String. And, in cases where the resulting string is still not valid UTF-8, the caller may want to decide whether to fail or to use
to_string_lossy()
.The text was updated successfully, but these errors were encountered: