[Bug] Tokenizer.encode: prepend_id scope issue causes NameError when prepend is string

## Bug

In `Tokenizer.encode()`, when `prepend` is provided as a string (e.g. the BOS token string), the `prepend_id` variable is assigned inside the outer `if prepend is not None:` block but referenced inside a nested `if isinstance(text, str):` block. This creates a fragile scope dependency where `prepend_id` may be referenced before being defined if the code structure changes, and more critically — when `prepend` is a string, `encode_single_token()` is called which can raise a `ValueError` if the string is not a valid single token.

## Version

- autoresearch: latest (main branch)
- Python: 3.10+
- Dependencies: tiktoken>=0.11.0

## Root Cause Analysis

**Location:** `prepare.py` — `Tokenizer.encode()` method

```python
def encode(self, text, prepend=None, num_threads=8):
    if prepend is not None:
        prepend_id = prepend if isinstance(prepend, int) else self.enc.encode_single_token(prepend)
    # prepend_id NOT guaranteed to be defined here!
    if isinstance(text, str):
        ids = self.enc.encode_ordinary(text)
        if prepend is not None:
            ids.insert(0, prepend_id)  # NameError if prepend was string but encode_single_token failed
```

The problem:
1. When `prepend` is a string, `encode_single_token(prepend)` is called
2. `encode_single_token()` raises `ValueError` if the string is not exactly one token
3. The exception propagates up, making it impossible to prepend multi-token sequences
4. Even for single tokens, the code structure is fragile — the guard `if prepend is not None:` in the inner block checks the original `prepend` variable, not whether `prepend_id` was successfully computed

## Reproduction Steps

1. Create a tokenizer with BOS token `<|reserved_0|>`
2. Call `tokenizer.encode("hello", prepend="<|reserved_0|>")`
3. If `<|reserved_0|>` is multiple tokens in the BPE vocabulary, `encode_single_token()` raises `ValueError`

## Proposed Fix

```python
def encode(self, text, prepend=None, num_threads=8):
    prepend_id = None
    if prepend is not None:
        if isinstance(prepend, int):
            prepend_id = prepend
        else:
            try:
                prepend_id = self.enc.encode_single_token(prepend)
            except ValueError:
                # Fall back: encode as ordinary text and take first token
                fallback = self.enc.encode_ordinary(prepend)
                prepend_id = fallback[0] if fallback else None

    if isinstance(text, str):
        ids = self.enc.encode_ordinary(text)
        if prepend_id is not None:
            ids.insert(0, prepend_id)
    elif isinstance(text, list):
        ids = self.enc.encode_ordinary_batch(text, num_threads=num_threads)
        if prepend_id is not None:
            for row in ids:
                row.insert(0, prepend_id)
    else:
        raise ValueError(f"Invalid input type: {type(text)}")
    return ids
```

## Environment

- Python: 3.10+
- autoresearch: latest
- tiktoken: 0.11.0+


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Tokenizer.encode: prepend_id scope issue causes NameError when prepend is string #348

Bug

Version

Root Cause Analysis

Reproduction Steps

Proposed Fix

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Bug] Tokenizer.encode: prepend_id scope issue causes NameError when prepend is string #348

Description

Bug

Version

Root Cause Analysis

Reproduction Steps

Proposed Fix

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions