Bug
In Tokenizer.encode(), when prepend is provided as a string (e.g. the BOS token string), the prepend_id variable is assigned inside the outer if prepend is not None: block but referenced inside a nested if isinstance(text, str): block. This creates a fragile scope dependency where prepend_id may be referenced before being defined if the code structure changes, and more critically — when prepend is a string, encode_single_token() is called which can raise a ValueError if the string is not a valid single token.
Version
- autoresearch: latest (main branch)
- Python: 3.10+
- Dependencies: tiktoken>=0.11.0
Root Cause Analysis
Location: prepare.py — Tokenizer.encode() method
def encode(self, text, prepend=None, num_threads=8):
if prepend is not None:
prepend_id = prepend if isinstance(prepend, int) else self.enc.encode_single_token(prepend)
# prepend_id NOT guaranteed to be defined here!
if isinstance(text, str):
ids = self.enc.encode_ordinary(text)
if prepend is not None:
ids.insert(0, prepend_id) # NameError if prepend was string but encode_single_token failed
The problem:
- When
prepend is a string, encode_single_token(prepend) is called
encode_single_token() raises ValueError if the string is not exactly one token
- The exception propagates up, making it impossible to prepend multi-token sequences
- Even for single tokens, the code structure is fragile — the guard
if prepend is not None: in the inner block checks the original prepend variable, not whether prepend_id was successfully computed
Reproduction Steps
- Create a tokenizer with BOS token
<|reserved_0|>
- Call
tokenizer.encode("hello", prepend="<|reserved_0|>")
- If
<|reserved_0|> is multiple tokens in the BPE vocabulary, encode_single_token() raises ValueError
Proposed Fix
def encode(self, text, prepend=None, num_threads=8):
prepend_id = None
if prepend is not None:
if isinstance(prepend, int):
prepend_id = prepend
else:
try:
prepend_id = self.enc.encode_single_token(prepend)
except ValueError:
# Fall back: encode as ordinary text and take first token
fallback = self.enc.encode_ordinary(prepend)
prepend_id = fallback[0] if fallback else None
if isinstance(text, str):
ids = self.enc.encode_ordinary(text)
if prepend_id is not None:
ids.insert(0, prepend_id)
elif isinstance(text, list):
ids = self.enc.encode_ordinary_batch(text, num_threads=num_threads)
if prepend_id is not None:
for row in ids:
row.insert(0, prepend_id)
else:
raise ValueError(f"Invalid input type: {type(text)}")
return ids
Environment
- Python: 3.10+
- autoresearch: latest
- tiktoken: 0.11.0+
Bug
In
Tokenizer.encode(), whenprependis provided as a string (e.g. the BOS token string), theprepend_idvariable is assigned inside the outerif prepend is not None:block but referenced inside a nestedif isinstance(text, str):block. This creates a fragile scope dependency whereprepend_idmay be referenced before being defined if the code structure changes, and more critically — whenprependis a string,encode_single_token()is called which can raise aValueErrorif the string is not a valid single token.Version
Root Cause Analysis
Location:
prepare.py—Tokenizer.encode()methodThe problem:
prependis a string,encode_single_token(prepend)is calledencode_single_token()raisesValueErrorif the string is not exactly one tokenif prepend is not None:in the inner block checks the originalprependvariable, not whetherprepend_idwas successfully computedReproduction Steps
<|reserved_0|>tokenizer.encode("hello", prepend="<|reserved_0|>")<|reserved_0|>is multiple tokens in the BPE vocabulary,encode_single_token()raisesValueErrorProposed Fix
Environment