Skip to content

Generalize tokenizer, enable future parametrization of architectures#38

Open
akaIDIOT wants to merge 6 commits intomainfrom
feature/generalize-tokenizer
Open

Generalize tokenizer, enable future parametrization of architectures#38
akaIDIOT wants to merge 6 commits intomainfrom
feature/generalize-tokenizer

Conversation

@akaIDIOT
Copy link
Copy Markdown
Member

No description provided.

Comment on lines +73 to +77
@pytest.mark.skipif(os.environ.get('CI') == 'true', reason="don't run this test on CI")
def test_embed_unknown_architecture(anchor, model):
with pytest.raises(KeyError, match='mips'):
# FIXME: this fails, kwargs are not passed along all the way to the tokenizer :(
model.encode(anchor, architecture='mips')
Copy link
Copy Markdown
Member Author

@akaIDIOT akaIDIOT Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this test fails, sentence_transformers.base.modules.transformer.Transformer.preprocess() does not pass it along... 🤔

(call stack visible by setting a breakpoint inside ASMTokenizer.tokenize())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does ST contribute to this? Would this work differently in the refactor branch?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The actual not passing along of that kwarg is ... not done by ST, so it might. I'll take a gander at the other branch, see if that alleviates the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants