Skip to content

Conversation

RidwanAdebosin
Copy link
Contributor

Summary

Fixes the Unigram tokenizer's token_to_id and id_to_token functions which previously contained TODO placeholders, making Unigram models unusable for decoding and vocabulary inspection.

Problem

  • token_to_id always returned Some 0 for every token
  • id_to_token always returned None for every ID
  • Made Unigram models completely unusable for decoding or vocab inspection

Solution

  • Implemented proper bidirectional token↔ID mapping
  • Added hashtable optimization (token_map) for O(1) lookups
  • Precompute token->ID mapping at model creation using List.iteri
  • Return None for unknown tokens and out-of-bounds IDs

Changes Made

Modified Files:

  1. saga/lib/tokenizers/models.ml

    • Updated unigram_model type to include token_map hashtable
    • Modified unigram constructor to precompute token->ID hashtable
    • Implemented token_to_id using hashtable lookup (O(1) instead of O(n))
    • Implemented id_to_token with proper bounds checking
  2. saga/lib/tokenizers/models.mli

    • Updated interface to include token_map field
  3. saga/lib/tokenizers/trainers.ml

    • Updated train_unigram to create token_map when building model
  4. saga/test/test_tokenization.ml

    • Added comprehensive Unigram-specific tests

Tests Added

  • ✅ Basic token↔ID lookups
  • ✅ Out-of-vocab queries (returns None)
  • ✅ Out-of-bounds ID queries (returns None)
  • ✅ Round-trip conversions (token→ID→token)
  • ✅ Empty vocabulary edge case
  • ✅ Large vocabulary (10,000 tokens)
  • ✅ Duplicate tokens handling
  • ✅ Special characters & Unicode support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant