Add Byte Pair Encoding (BPE) class for subword tokenization #3056

Cydral · 2025-02-15T15:53:16Z

Description:

This PR introduces a new bpe_tokenizer class to Dlib, implementing the Byte Pair Encoding (BPE) algorithm for subword tokenization. The BPE tokenizer is a widely used technique in natural language processing (NLP) for handling out-of-vocabulary words and reducing vocabulary size while maintaining text representation capabilities.

Key Features:

BPE Algorithm: Implements the BPE algorithm as described in Sennrich et al., 2016.
Special Tokens: Supports predefined special tokens (e.g., <text>, <url>, <image>) for marking specific elements in the text.
Training and Encoding: Provides methods for training the tokenizer on a text corpus and encoding/decoding text into subword tokens.
Serialization: Supports saving and loading the tokenizer model and vocabulary for reuse.
Thread-Safe: Utilizes multi-threading for efficient frequency statistics computation during training.

Usage:

dlib::bpe_tokenizer tokenizer;
tokenizer.train(corpus_text, target_vocab_size, true); // Train on a text corpus
std::vector<int> tokens = tokenizer.encode("Sample text to tokenize."); // Encode text
std::string decoded_text = tokenizer.decode(tokens); // Decode tokens back to text

- Implement BPE (Byte Pair Encoding) tokenization - Add training and encoding methods - Include unit tests

…processing

davisking · 2025-02-27T02:51:18Z

dlib/test/tokenizer.cpp

+        std::ofstream out_file("bpe_tokenizer_model.dat", std::ios::binary);
+        serialize(test, out_file);
+        out_file.close();
+
+        bpe_tok loaded_test;
+        std::ifstream in_file("bpe_tokenizer_model.dat", std::ios::binary);
+        deserialize(loaded_test, in_file);
+        in_file.close();


This is a good thing to test. But use the std::ostringstream and std::istringstream so the test doesn't end up leaving files around. With the stringstream it's all just in memory. And you don't need to mess with .close() so it's simpler too.

davisking · 2025-02-27T02:53:46Z

dlib/test/tokenizer.cpp

+            std::cout << "Original: " << text << "\n";
+            std::cout << "Encoded: ";
+            for (int id : encoded) std::cout << id << " ";
+            std::cout << "\nDecoded: " << decoded << "\n----------------------------------------\n";


Do DLIB_TEST(text == decoded) right? Don't cout anything. Need to DLIB_TEST() something for the test to do anything.

davisking · 2025-02-27T02:56:21Z

dlib/tokenizer/bpe_tokenizer.h

+     *       This limit can be adjusted by modifying the `MAX_TOKEN_LENGTH` constant.
+     *
+     */
+    class bpe_tokenizer


This is cool. Add a bpe_tokenizer_abstract.h file and put the docs in there so it's like all the other docs in dlib. I'll eventually link that into the dlib.net web page and docs and whatnot too. And use the same comment/doc style as the other parts of the library (outlined in https://dlib.net/intro.html#notation but there are tons of examples in the library)

Cydral added 3 commits February 15, 2025 16:42

Add new BPE_Tokenizer class to Dlib

5ddf55e

- Implement BPE (Byte Pair Encoding) tokenization - Add training and encoding methods - Include unit tests

Update

7e3ee1a

Update

10aea11

Cydral changed the title ~~Add Byte Pair Encoding Class for Subword Tokenization~~ Add Byte Pair Encoding (BPE) class for subword tokenization Feb 15, 2025

Last update: optimize BPE tokenizer encoding with parallel paragraph …

fa6bead

…processing

davisking reviewed Feb 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Byte Pair Encoding (BPE) class for subword tokenization #3056

Add Byte Pair Encoding (BPE) class for subword tokenization #3056

Cydral commented Feb 15, 2025

davisking Feb 27, 2025 •

edited

Loading

davisking Feb 27, 2025

davisking Feb 27, 2025

Add Byte Pair Encoding (BPE) class for subword tokenization #3056

Are you sure you want to change the base?

Add Byte Pair Encoding (BPE) class for subword tokenization #3056

Conversation

Cydral commented Feb 15, 2025

Description:

Key Features:

Usage:

davisking Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

davisking Feb 27, 2025

Choose a reason for hiding this comment

davisking Feb 27, 2025

Choose a reason for hiding this comment

davisking Feb 27, 2025 •

edited

Loading