Skip to content

Conversation

maxdebayser
Copy link

@maxdebayser maxdebayser commented Sep 1, 2025

When the truncation max_len is shorter than the number of added tokens there is an underflow issue even when the user didn't ask to add special tokens.

For example, this code here:

use tokenizers::{Tokenizer, TruncationParams};

fn main() -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
    let max_length = 1;
    let mut tokenizer = Tokenizer::from_pretrained("ibm-granite/granite-embedding-125m-english", None)?;
    let tokenizer = tokenizer.with_truncation(Some(TruncationParams {
        max_length,
        strategy: tokenizers::utils::truncation::TruncationStrategy::LongestFirst,
        direction: tokenizers::utils::truncation::TruncationDirection::Right,
        stride: 0,
    }))?;

    let data = String::from("This is it s simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.");
    let data: Vec<_> = data.lines().collect();
    let add_special_tokens = false;

    let result = tokenizer.encode_batch_char_offsets(data, add_special_tokens)?;
    println!("{:?}", result[0].get_ids());
    Ok(())
}

Fails with the following error:

    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.09s
     Running `target/debug/tokenizers-troubleshoot`
max_length=1, n_added_tokens=2

thread 'main' panicked at /home/mdevino/dev/projects/tokenizers-troubleshoot/tokenizers/tokenizers/src/tokenizer/mod.rs:625:40:
attempt to subtract with overflow
stack backtrace:
   0: __rustc::rust_begin_unwind
             at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/std/src/panicking.rs:697:5
   1: core::panicking::panic_fmt
             at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/panicking.rs:75:14
   2: core::panicking::panic_const::panic_const_sub_overflow
             at /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/panicking.rs:175:17
   3: tokenizers::tokenizer::TokenizerImpl<M,N,PT,PP,D>::with_truncation
             at /tokenizers/tokenizers/src/tokenizer/mod.rs:625:40
   4: tokenizers_troubleshoot::main
             at ./src/main.rs:6:31
   5: core::ops::function::FnOnce::call_once
             at /toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

This error happens with the ibm-granite model because it's a RobertaModel which adds 2 special tokens. With LLama this issue does not happen. I found this problem in the context of this vllm issue: vllm-project/vllm#22635 .

The idea of this PR is to move the verification that contains the code which is susceptible to this problem from the initialization to the actual encode call, where it can take into account the value of add_special_tokens.

When the truncation max_len is shorter than the number of added
tokens there is an underflow issue even when the user didn't ask
to add special tokens.

Signed-off-by: Max de Bayser <[email protected]>
Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the bug report + PR.

I'm not sure the fix you propose is the optimal solution. To understand the problem space better, are you still using added tokens at this point ? Wouldn't it be even easier to simply remove the added tokens by removing the post_processor from the tokenizer if you are not using them ?

@@ -1216,7 +1206,7 @@ where

if add_special_tokens && n_added_tokens > 0 {
let params = TruncationParams {
max_length: trunc.max_length - n_added_tokens,
max_length: if n_added_tokens > trunc.max_length {0} else {trunc.max_length - n_added_tokens},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
max_length: if n_added_tokens > trunc.max_length {0} else {trunc.max_length - n_added_tokens},
max_length: trunc.max_length.checked_sub(n_added_tokens).unwrap_or(0)

NIT: I feel like this is more readable

@@ -506,7 +506,7 @@ impl DerefMut for Tokenizer {

#[derive(thiserror::Error, Debug)]
#[error("{0}")]
pub struct TruncationParamError(String);
pub struct TruncationParamError(pub String);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No pub here please. Create a construction if needed.

@@ -619,16 +619,6 @@ where
///
/// Fails if `stride` is too high relative to `max_length` and `post_processor.added_tokens()`
pub fn with_truncation(&mut self, trunc: Option<TruncationParams>) -> Result<&mut Self> {
if let Some(trunc_params) = &trunc {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel great about modifying this sanitation.
It seems to me that one should be aware of the added tokens as it's the standard way to tokenizer defined by the tokenizer's creator. So preventing chunking below those is kind of important.

If someone REALLY wants super low chunking and wants to ignore the added tokens. It seems specific enough that simply changing the post_processor for None would be much simpler at this points.

So we can keep this footgun check alive, and power users can still modify the tokenizers' behavior.

@@ -619,16 +619,6 @@ where
///
/// Fails if `stride` is too high relative to `max_length` and `post_processor.added_tokens()`
pub fn with_truncation(&mut self, trunc: Option<TruncationParams>) -> Result<&mut Self> {
if let Some(trunc_params) = &trunc {
let n_added_tokens = self.get_n_added_tokens(false);
let effective_max_length = trunc_params.max_length - n_added_tokens;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be checked_sub so we don't panic either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants