Skip to content

[Bug] Tokenizer.encode: prepend_id scope issue causes NameError when prepend is string #348

@ai-nurmamat

Description

@ai-nurmamat

Bug

In Tokenizer.encode(), when prepend is provided as a string (e.g. the BOS token string), the prepend_id variable is assigned inside the outer if prepend is not None: block but referenced inside a nested if isinstance(text, str): block. This creates a fragile scope dependency where prepend_id may be referenced before being defined if the code structure changes, and more critically — when prepend is a string, encode_single_token() is called which can raise a ValueError if the string is not a valid single token.

Version

  • autoresearch: latest (main branch)
  • Python: 3.10+
  • Dependencies: tiktoken>=0.11.0

Root Cause Analysis

Location: prepare.pyTokenizer.encode() method

def encode(self, text, prepend=None, num_threads=8):
    if prepend is not None:
        prepend_id = prepend if isinstance(prepend, int) else self.enc.encode_single_token(prepend)
    # prepend_id NOT guaranteed to be defined here!
    if isinstance(text, str):
        ids = self.enc.encode_ordinary(text)
        if prepend is not None:
            ids.insert(0, prepend_id)  # NameError if prepend was string but encode_single_token failed

The problem:

  1. When prepend is a string, encode_single_token(prepend) is called
  2. encode_single_token() raises ValueError if the string is not exactly one token
  3. The exception propagates up, making it impossible to prepend multi-token sequences
  4. Even for single tokens, the code structure is fragile — the guard if prepend is not None: in the inner block checks the original prepend variable, not whether prepend_id was successfully computed

Reproduction Steps

  1. Create a tokenizer with BOS token <|reserved_0|>
  2. Call tokenizer.encode("hello", prepend="<|reserved_0|>")
  3. If <|reserved_0|> is multiple tokens in the BPE vocabulary, encode_single_token() raises ValueError

Proposed Fix

def encode(self, text, prepend=None, num_threads=8):
    prepend_id = None
    if prepend is not None:
        if isinstance(prepend, int):
            prepend_id = prepend
        else:
            try:
                prepend_id = self.enc.encode_single_token(prepend)
            except ValueError:
                # Fall back: encode as ordinary text and take first token
                fallback = self.enc.encode_ordinary(prepend)
                prepend_id = fallback[0] if fallback else None

    if isinstance(text, str):
        ids = self.enc.encode_ordinary(text)
        if prepend_id is not None:
            ids.insert(0, prepend_id)
    elif isinstance(text, list):
        ids = self.enc.encode_ordinary_batch(text, num_threads=num_threads)
        if prepend_id is not None:
            for row in ids:
                row.insert(0, prepend_id)
    else:
        raise ValueError(f"Invalid input type: {type(text)}")
    return ids

Environment

  • Python: 3.10+
  • autoresearch: latest
  • tiktoken: 0.11.0+

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions