Bug: BPB metric inflated by UTF-8 replacement characters in token byte count

### Description

In `prepare.py` (lines 188-193), the token byte count for BPB evaluation is computed by decoding each token via tiktoken and re-encoding to UTF-8:

```python
token_str = enc.decode([token_id])
token_bytes_list.append(len(token_str.encode("utf-8")))
```

For BPE tokens whose raw bytes are not valid standalone UTF-8 (e.g. a single continuation byte `0x80` = 1 raw byte), `tiktoken.decode()` replaces invalid sequences with U+FFFD (the Unicode replacement character), which encodes to 3 UTF-8 bytes. This inflates the byte-count denominator in BPB, producing **artificially lower (better-looking) scores**.

### Impact

The BPB metric — the core evaluation metric that drives the entire autoresearch loop — is silently underestimated. Experiments may appear to perform better than they actually do, and comparisons between runs with different tokenizers (which may have different ratios of non-UTF-8 tokens) are not apples-to-apples.

### Fix

Use `mergeable_ranks` directly to get the true raw byte length of each token:

```python
rank_to_bytes = {rank: raw for raw, rank in mergeable_ranks.items()}
for token_id in range(enc.n_vocab):
    if token_id in rank_to_bytes:
        token_bytes_list.append(len(rank_to_bytes[token_id]))
    else:
        token_bytes_list.append(0)  # special tokens
```

I will submit a fix shortly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: BPB metric inflated by UTF-8 replacement characters in token byte count #384

Description

Impact

Fix

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Bug: BPB metric inflated by UTF-8 replacement characters in token byte count #384

Description

Description

Impact

Fix

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions