Skip to content

Bug: BPB metric inflated by UTF-8 replacement characters in token byte count #384

@warren618

Description

@warren618

Description

In prepare.py (lines 188-193), the token byte count for BPB evaluation is computed by decoding each token via tiktoken and re-encoding to UTF-8:

token_str = enc.decode([token_id])
token_bytes_list.append(len(token_str.encode("utf-8")))

For BPE tokens whose raw bytes are not valid standalone UTF-8 (e.g. a single continuation byte 0x80 = 1 raw byte), tiktoken.decode() replaces invalid sequences with U+FFFD (the Unicode replacement character), which encodes to 3 UTF-8 bytes. This inflates the byte-count denominator in BPB, producing artificially lower (better-looking) scores.

Impact

The BPB metric — the core evaluation metric that drives the entire autoresearch loop — is silently underestimated. Experiments may appear to perform better than they actually do, and comparisons between runs with different tokenizers (which may have different ratios of non-UTF-8 tokens) are not apples-to-apples.

Fix

Use mergeable_ranks directly to get the true raw byte length of each token:

rank_to_bytes = {rank: raw for raw, rank in mergeable_ranks.items()}
for token_id in range(enc.n_vocab):
    if token_id in rank_to_bytes:
        token_bytes_list.append(len(rank_to_bytes[token_id]))
    else:
        token_bytes_list.append(0)  # special tokens

I will submit a fix shortly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions