Skip to content

process input as str by adding it to a list#43

Open
fleurvanl wants to merge 2 commits intomainfrom
23-tokenizer-breaks-on-single-string-input
Open

process input as str by adding it to a list#43
fleurvanl wants to merge 2 commits intomainfrom
23-tokenizer-breaks-on-single-string-input

Conversation

@fleurvanl
Copy link
Copy Markdown
Contributor

No description provided.

@fleurvanl fleurvanl linked an issue May 1, 2026 that may be closed by this pull request

def tokenize(self, texts, split_special_tokens=False, **kwargs):
encoded_inputs = []
texts = [texts] if isinstance(texts, str) else texts
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works for me, but looking at transformers is see two distinct ways of doing this: the one above and checking whether the first element of the sequence is a sequence.

I have no idea why they sometimes use one and sometimes the other.

https://github.com/huggingface/transformers/blob/7f6419e67de355ee173344c1bfd68cb60288e121/src/transformers/tokenization_python.py#L719

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tokenizer breaks on single-string input

3 participants