Unused Unicode Character Filter #1832

sanderland · 2025-07-23T11:05:56Z

This PR adds a Unicode normalizer to the tokenizers library, enabling filtering of unused and private use code points based on Unicode properties. These characters are often artifacts from editing text with proprietary programs and are not useful for NLP tasks. Filtering them out can improve tokenizer quality.

The implementation covers the Rust core, Python bindings, and Node.js bindings, along with corresponding tests.

…nd Node.js bindings

…er changes

Narsil · 2025-09-04T14:03:49Z

Thanks for the great PR !!

You created 3 booleans. Shouldn't we maybe use a flag array so users can filter in/out any particular categories maybe ?

Are there no way to reuse any of the pre-existing dependencies maybe ?

https://en.wikipedia.org/wiki/UTF-8#Surrogates
https://en.wikipedia.org/wiki/Private_Use_Areas

Sander Land added 5 commits July 23, 2025 12:49

Implement and update Unicode normalizer logic in Rust core, Python, a…

f84de1e

…nd Node.js bindings

clean

889f90a

clean

9c7b22a

remove eng_latn_300mb-hf.json, edit bindings/node/yarn.lock and 5 oth…

83545c9

…er changes

clean

3d839f1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unused Unicode Character Filter #1832

Unused Unicode Character Filter #1832

Uh oh!

sanderland commented Jul 23, 2025

Uh oh!

Narsil commented Sep 4, 2025

Uh oh!

Uh oh!

Unused Unicode Character Filter #1832

Are you sure you want to change the base?

Unused Unicode Character Filter #1832

Uh oh!

Conversation

sanderland commented Jul 23, 2025

Uh oh!

Narsil commented Sep 4, 2025

Uh oh!

Uh oh!