Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add access to segment text as bytes #79

Merged
merged 1 commit into from
Aug 18, 2023

Conversation

travolin
Copy link

Access method needed for issues related to Issue #46

Change

This change adds an additional access method to the whisper_state to allow access to the segment text as a vector of u8.

Use Case

This is used to handle cases where the segment is not a valid utf-8 string and will throw an error when using full_get_segment_text. This can be pretty common for multi utf-8 byte languages like Mandarin. The segment is not guarantied to end on the proper utf-8 boundary, but can be found in the next segment. When an error is returned from full_get_segment_text the library user can simply grab the bytes (with this new method), buffer them and add the next segment to them until a proper utf8 string is found.

Testing

All testing was done against the following random Chinese language podcast "https://anchor.fm/s/1553388c/podcast/play/60927763/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2022-10-20%2F3b2b6ec4-4399-393e-c056-06e16f5c0c9e.mp3"

Copy link
Owner

@tazz4843 tazz4843 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering what languages would throw this sort of issue, after mentioning it, now that makes sense. LGTM, thanks for the PR!

@tazz4843 tazz4843 merged commit 24e6a00 into tazz4843:master Aug 18, 2023
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants