-
Notifications
You must be signed in to change notification settings - Fork 1k
Cannot tokenize byte sequences that are not valid UTF-8 due to design flaw #388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
BPE can operate on arbitrary byte sequences, but the regex splitting is in Unicode space. There is a private (Also as a general note, while there are many possible tokenisations, models only see a subset of possible tokenisations at training time, so if you tokenise something in a way that the model has not ever seen before, you will typically get degraded performance that is quite hard to debug) |
I greatly appreciate the pointer! It seems like the solution ought to be taking one of the more obvious generalizations of the pre-segmentation regex and shoving all the incorrect bytes into a character class of your choice (alphabetical?). But it would certainly be quite the lift. |
Ok, well, _encode_bytes returns incorrect results for literally the first byte sequence I put into it, so I can file a separate issue or something. In particular, if you put in b"\x80"*6, you get back []. and obviously if you decode [], you get 0 bytes back, not 6. |
Hm yeah, code looks a little wrong (its only current use is as an internal helper in one of tiktoken's internal-only test). Let me fix. |
#389 should fix and includes a basic property test. (Even if things are now roundtripping correctly, the word of caution about putting models being out of distribution still applies!) |
amazing! thank you! get some sleep :) |
Thanks for the issue! The other thing I'll mention (mostly in case someone later stumbles upon this issue) is that if your use case only involves single tokens, Line 242 in 4560a88
|
Hello,
The BPE algorithm is capable of tokenizing any byte sequence, and LLMs generally accept any sequence of tokens and use token dictionaries that can successfully represent any byte sequence, but the encode method in tiktoken accepts a type that has to be valid UTF-8. So there are lots of byte sequences, many of which are only 1 byte long, which you cannot tokenize using this library.
The text was updated successfully, but these errors were encountered: