-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Byte Pair Encoding (BPE) class for subword tokenization #3056
base: master
Are you sure you want to change the base?
Conversation
std::ofstream out_file("bpe_tokenizer_model.dat", std::ios::binary); | ||
serialize(test, out_file); | ||
out_file.close(); | ||
|
||
bpe_tok loaded_test; | ||
std::ifstream in_file("bpe_tokenizer_model.dat", std::ios::binary); | ||
deserialize(loaded_test, in_file); | ||
in_file.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good thing to test. But use the std::ostringstream
and std::istringstream
so the test doesn't end up leaving files around. With the stringstream it's all just in memory. And you don't need to mess with .close() so it's simpler too.
std::cout << "Original: " << text << "\n"; | ||
std::cout << "Encoded: "; | ||
for (int id : encoded) std::cout << id << " "; | ||
std::cout << "\nDecoded: " << decoded << "\n----------------------------------------\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do DLIB_TEST(text == decoded)
right? Don't cout anything. Need to DLIB_TEST()
something for the test to do anything.
* This limit can be adjusted by modifying the `MAX_TOKEN_LENGTH` constant. | ||
* | ||
*/ | ||
class bpe_tokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cool. Add a bpe_tokenizer_abstract.h file and put the docs in there so it's like all the other docs in dlib. I'll eventually link that into the dlib.net web page and docs and whatnot too. And use the same comment/doc style as the other parts of the library (outlined in https://dlib.net/intro.html#notation but there are tons of examples in the library)
Description:
This PR introduces a new
bpe_tokenizer
class to Dlib, implementing the Byte Pair Encoding (BPE) algorithm for subword tokenization. The BPE tokenizer is a widely used technique in natural language processing (NLP) for handling out-of-vocabulary words and reducing vocabulary size while maintaining text representation capabilities.Key Features:
<text>
,<url>
,<image>
) for marking specific elements in the text.Usage: