Fix unsigned integer underflow issue with truncation #1859

maxdebayser · 2025-09-01T19:01:42Z

When the truncation max_len is shorter than the number of added tokens there is an underflow issue even when the user didn't ask to add special tokens.

For example, this code here:

use tokenizers::{Tokenizer, TruncationParams};

fn main() -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
    let max_length = 1;
    let mut tokenizer = Tokenizer::from_pretrained("ibm-granite/granite-embedding-125m-english", None)?;
    let tokenizer = tokenizer.with_truncation(Some(TruncationParams {
        max_length,
        strategy: tokenizers::utils::truncation::TruncationStrategy::LongestFirst,
        direction: tokenizers::utils::truncation::TruncationDirection::Right,
        stride: 0,
    }))?;

    let data = String::from("This is it s simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.");
    let data: Vec<_> = data.lines().collect();
    let add_special_tokens = false;

    let result = tokenizer.encode_batch_char_offsets(data, add_special_tokens)?;
    println!("{:?}", result[0].get_ids());
    Ok(())
}

Fails with the following error:

    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.09s
     Running `target/debug/tokenizers-troubleshoot`
max_length=1, n_added_tokens=2

thread 'main' panicked at /home/mdevino/dev/projects/tokenizers-troubleshoot/tokenizers/tokenizers/src/tokenizer/mod.rs:625:40:
attempt to subtract with overflow
stack backtrace:
   0: __rustc::rust_begin_unwind
             at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/panicking.rs:697:5
   1: core::panicking::panic_fmt
             at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/panicking.rs:75:14
   2: core::panicking::panic_const::panic_const_sub_overflow
             at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/panicking.rs:175:17
   3: tokenizers::tokenizer::TokenizerImpl<M,N,PT,PP,D>::with_truncation
             at /tokenizers/tokenizers/src/tokenizer/mod.rs:625:40
   4: tokenizers_troubleshoot::main
             at ./src/main.rs:6:31
   5: core::ops::function::FnOnce::call_once
             at /toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

This error happens with the ibm-granite model because it's a RobertaModel which adds 2 special tokens. With LLama this issue does not happen. I found this problem in the context of this vllm issue: vllm-project/vllm#22635 .

The idea of this PR is to move the verification that contains the code which is susceptible to this problem from the initialization to the actual encode call, where it can take into account the value of add_special_tokens.

When the truncation max_len is shorter than the number of added tokens there is an underflow issue even when the user didn't ask to add special tokens. Signed-off-by: Max de Bayser <[email protected]>

Narsil

Thanks for the bug report + PR.

I'm not sure the fix you propose is the optimal solution. To understand the problem space better, are you still using added tokens at this point ? Wouldn't it be even easier to simply remove the added tokens by removing the post_processor from the tokenizer if you are not using them ?

Narsil · 2025-09-04T13:48:03Z

tokenizers/src/tokenizer/mod.rs

@@ -1216,7 +1206,7 @@ where

                if add_special_tokens && n_added_tokens > 0 {
                    let params = TruncationParams {
-                        max_length: trunc.max_length - n_added_tokens,
+                        max_length: if n_added_tokens > trunc.max_length {0} else {trunc.max_length - n_added_tokens},


Suggested change

max_length: if n_added_tokens > trunc.max_length {0} else {trunc.max_length - n_added_tokens},

max_length: trunc.max_length.checked_sub(n_added_tokens).unwrap_or(0)

NIT: I feel like this is more readable

Narsil · 2025-09-04T13:48:31Z

tokenizers/src/tokenizer/mod.rs

@@ -506,7 +506,7 @@ impl DerefMut for Tokenizer {

 #[derive(thiserror::Error, Debug)]
 #[error("{0}")]
-pub struct TruncationParamError(String);
+pub struct TruncationParamError(pub String);


No pub here please. Create a construction if needed.

Narsil · 2025-09-04T13:52:28Z

tokenizers/src/tokenizer/mod.rs

@@ -619,16 +619,6 @@ where
    ///
    /// Fails if `stride` is too high relative to `max_length` and `post_processor.added_tokens()`
    pub fn with_truncation(&mut self, trunc: Option<TruncationParams>) -> Result<&mut Self> {
-        if let Some(trunc_params) = &trunc {


I don't feel great about modifying this sanitation.
It seems to me that one should be aware of the added tokens as it's the standard way to tokenizer defined by the tokenizer's creator. So preventing chunking below those is kind of important.

If someone REALLY wants super low chunking and wants to ignore the added tokens. It seems specific enough that simply changing the post_processor for None would be much simpler at this points.

So we can keep this footgun check alive, and power users can still modify the tokenizers' behavior.

Narsil · 2025-09-04T13:52:51Z

tokenizers/src/tokenizer/mod.rs

@@ -619,16 +619,6 @@ where
    ///
    /// Fails if `stride` is too high relative to `max_length` and `post_processor.added_tokens()`
    pub fn with_truncation(&mut self, trunc: Option<TruncationParams>) -> Result<&mut Self> {
-        if let Some(trunc_params) = &trunc {
-            let n_added_tokens = self.get_n_added_tokens(false);
-            let effective_max_length = trunc_params.max_length - n_added_tokens;


This should probably be checked_sub so we don't panic either.

Fix unsigned integer underflow issue with truncation

569181e

When the truncation max_len is shorter than the number of added tokens there is an underflow issue even when the user didn't ask to add special tokens. Signed-off-by: Max de Bayser <[email protected]>

maxdebayser mentioned this pull request Sep 1, 2025

[Bug]: Possible mismatch in truncate_prompt_tokens value validation for -1 vllm-project/vllm#22635

Closed

Narsil reviewed Sep 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix unsigned integer underflow issue with truncation #1859

Fix unsigned integer underflow issue with truncation #1859

Uh oh!

maxdebayser commented Sep 1, 2025 •

edited

Loading

Uh oh!

Narsil left a comment

Uh oh!

Narsil Sep 4, 2025

Uh oh!

Narsil Sep 4, 2025

Uh oh!

Narsil Sep 4, 2025

Uh oh!

Narsil Sep 4, 2025

Uh oh!

Uh oh!

	max_length: if n_added_tokens > trunc.max_length {0} else {trunc.max_length - n_added_tokens},
	max_length: trunc.max_length.checked_sub(n_added_tokens).unwrap_or(0)

Fix unsigned integer underflow issue with truncation #1859

Are you sure you want to change the base?

Fix unsigned integer underflow issue with truncation #1859

Uh oh!

Conversation

maxdebayser commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Narsil left a comment

Choose a reason for hiding this comment

Uh oh!

Narsil Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Narsil Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Narsil Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Narsil Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maxdebayser commented Sep 1, 2025 •

edited

Loading