Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Byte Pair Encoding (BPE) class for subword tokenization #3056

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Cydral
Copy link
Contributor

@Cydral Cydral commented Feb 15, 2025

Description:

This PR introduces a new bpe_tokenizer class to Dlib, implementing the Byte Pair Encoding (BPE) algorithm for subword tokenization. The BPE tokenizer is a widely used technique in natural language processing (NLP) for handling out-of-vocabulary words and reducing vocabulary size while maintaining text representation capabilities.

Key Features:

  • BPE Algorithm: Implements the BPE algorithm as described in Sennrich et al., 2016.
  • Special Tokens: Supports predefined special tokens (e.g., <text>, <url>, <image>) for marking specific elements in the text.
  • Training and Encoding: Provides methods for training the tokenizer on a text corpus and encoding/decoding text into subword tokens.
  • Serialization: Supports saving and loading the tokenizer model and vocabulary for reuse.
  • Thread-Safe: Utilizes multi-threading for efficient frequency statistics computation during training.

Usage:

dlib::bpe_tokenizer tokenizer;
tokenizer.train(corpus_text, target_vocab_size, true); // Train on a text corpus
std::vector<int> tokens = tokenizer.encode("Sample text to tokenize."); // Encode text
std::string decoded_text = tokenizer.decode(tokens); // Decode tokens back to text

- Implement BPE (Byte Pair Encoding) tokenization
- Add training and encoding methods
- Include unit tests
@Cydral Cydral changed the title Add Byte Pair Encoding Class for Subword Tokenization Add Byte Pair Encoding (BPE) class for subword tokenization Feb 15, 2025
Comment on lines +380 to +387
std::ofstream out_file("bpe_tokenizer_model.dat", std::ios::binary);
serialize(test, out_file);
out_file.close();

bpe_tok loaded_test;
std::ifstream in_file("bpe_tokenizer_model.dat", std::ios::binary);
deserialize(loaded_test, in_file);
in_file.close();
Copy link
Owner

@davisking davisking Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good thing to test. But use the std::ostringstream and std::istringstream so the test doesn't end up leaving files around. With the stringstream it's all just in memory. And you don't need to mess with .close() so it's simpler too.

Comment on lines +400 to +403
std::cout << "Original: " << text << "\n";
std::cout << "Encoded: ";
for (int id : encoded) std::cout << id << " ";
std::cout << "\nDecoded: " << decoded << "\n----------------------------------------\n";
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do DLIB_TEST(text == decoded) right? Don't cout anything. Need to DLIB_TEST() something for the test to do anything.

* This limit can be adjusted by modifying the `MAX_TOKEN_LENGTH` constant.
*
*/
class bpe_tokenizer
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool. Add a bpe_tokenizer_abstract.h file and put the docs in there so it's like all the other docs in dlib. I'll eventually link that into the dlib.net web page and docs and whatnot too. And use the same comment/doc style as the other parts of the library (outlined in https://dlib.net/intro.html#notation but there are tons of examples in the library)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants