Fix Unigram tokenizer vocabulary lookup (token_to_id and id_to_token) #117

RidwanAdebosin · 2025-10-17T08:41:54Z

Summary

Fixes the Unigram tokenizer's token_to_id and id_to_token functions which previously contained TODO placeholders, making Unigram models unusable for decoding and vocabulary inspection.

Problem

token_to_id always returned Some 0 for every token
id_to_token always returned None for every ID
Made Unigram models completely unusable for decoding or vocab inspection

Solution

Implemented proper bidirectional token↔ID mapping
Added hashtable optimization (token_map) for O(1) lookups
Precompute token->ID mapping at model creation using List.iteri
Return None for unknown tokens and out-of-bounds IDs

Changes Made

Modified Files:

saga/lib/tokenizers/models.ml
- Updated unigram_model type to include token_map hashtable
- Modified unigram constructor to precompute token->ID hashtable
- Implemented token_to_id using hashtable lookup (O(1) instead of O(n))
- Implemented id_to_token with proper bounds checking
saga/lib/tokenizers/models.mli
- Updated interface to include token_map field
saga/lib/tokenizers/trainers.ml
- Updated train_unigram to create token_map when building model
saga/test/test_tokenization.ml
- Added comprehensive Unigram-specific tests

Tests Added

✅ Basic token↔ID lookups
✅ Out-of-vocab queries (returns None)
✅ Out-of-bounds ID queries (returns None)
✅ Round-trip conversions (token→ID→token)
✅ Empty vocabulary edge case
✅ Large vocabulary (10,000 tokens)
✅ Duplicate tokens handling
✅ Special characters & Unicode support

…ions

…ed token retrieval

RidwanAdebosin added 3 commits October 17, 2025 08:22

fix(tokenizers): Improve token retrieval in Unigram model

ac62a90

feat(tokenizers): Add unit tests for Unigram model tokenization funct…

d0ae2e1

…ions

feat(tokenizers): Enhance Unigram model with token mapping for improv…

2d4cd18

…ed token retrieval

tmattio force-pushed the main branch from 4558e76 to 1a4cf71 Compare October 17, 2025 10:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Unigram tokenizer vocabulary lookup (token_to_id and id_to_token) #117

Fix Unigram tokenizer vocabulary lookup (token_to_id and id_to_token) #117

Uh oh!

RidwanAdebosin commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix Unigram tokenizer vocabulary lookup (token_to_id and id_to_token) #117

Are you sure you want to change the base?

Fix Unigram tokenizer vocabulary lookup (token_to_id and id_to_token) #117

Uh oh!

Conversation

RidwanAdebosin commented Oct 17, 2025

Summary

Problem

Solution

Changes Made

Modified Files:

Tests Added

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant