Add access to segment text as bytes #79

travolin · 2023-08-18T00:23:18Z

Access method needed for issues related to Issue #46

Change

This change adds an additional access method to the whisper_state to allow access to the segment text as a vector of u8.

Use Case

This is used to handle cases where the segment is not a valid utf-8 string and will throw an error when using full_get_segment_text. This can be pretty common for multi utf-8 byte languages like Mandarin. The segment is not guarantied to end on the proper utf-8 boundary, but can be found in the next segment. When an error is returned from full_get_segment_text the library user can simply grab the bytes (with this new method), buffer them and add the next segment to them until a proper utf8 string is found.

Testing

All testing was done against the following random Chinese language podcast "https://anchor.fm/s/1553388c/podcast/play/60927763/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2022-10-20%2F3b2b6ec4-4399-393e-c056-06e16f5c0c9e.mp3"

tazz4843

I was wondering what languages would throw this sort of issue, after mentioning it, now that makes sense. LGTM, thanks for the PR!

Add access to segment text as bytes

bdfbeb6

tazz4843 approved these changes Aug 18, 2023

View reviewed changes

tazz4843 merged commit 24e6a00 into tazz4843:master Aug 18, 2023
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add access to segment text as bytes #79

Add access to segment text as bytes #79

travolin commented Aug 18, 2023

tazz4843 left a comment

Add access to segment text as bytes #79

Add access to segment text as bytes #79

Conversation

travolin commented Aug 18, 2023

Change

Use Case

Testing

tazz4843 left a comment

Choose a reason for hiding this comment