Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chat template: update for processor #35953

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Jan 29, 2025

What does this PR do?

Prerequisite: we need images as batched list support for all VLMs, which is now done thanks to Yoni 💛

This PR adds:

  • Support for batched inputs, which I forgot to add the last time
  • Tiny hack for add_special_tokens as per @Rocketknight1's request
  • More tests for chat templates in the Mixin, so we don't write the same test per model. Note that for test to run, we need to have the correct chat_template in processor, not a dummy one
  • Video loading that can use video_fps which is claimed to be better than sampling N frames uniformly, especially for longer videos. NOTE: that will result in videos of different num_frames and can lead to errors when batching, so should be used only when bs=1. We don't support videos of different frame counts, yet
  • Ability to load from list of image frames saved as image files

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks good to me, I was not very thorough in chat-template code, leaving it for @Rocketknight1

Comment on lines +541 to +544
Number of frames to sample uniformly. Should be passed only when `fps=None`.
If not specified and `fps==None`, all frames are sampled.
fps (`int`, *optional*):
Number of frames to sample per second. Should be passed only when `num_frames=None`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth adding a check that just one of the args provided?

Copy link
Member Author

@zucchini-nlp zucchini-nlp Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think neither of them has a special priority, and users should choose only one sampling method to sample frames. Otherwise we assume it was user error, as we can't infer their actual intentions

Right, that was a question hehe, agreed and added yes. I'll look again where exactly it fits the best :)

Comment on lines +724 to +726
if fps is not None and num_frames is not None:
raise ValueError("`num_frames` and `fps` are mutually exclusive arguments, please use only one!")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, I see you are doing a check here.. probably better to delegate it to the function itself, but it's up to you

Comment on lines +821 to +830
# Load with `video_fps` and `num_frames` args, should raise an error
with self.assertRaises(ValueError):
out_dict_with_video = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
video_fps=video_fps,
num_frames=num_frames,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding a test for the error raised!

src/transformers/models/aria/modular_aria.py Show resolved Hide resolved
Comment on lines -832 to -833
if chat_template is not None:
setattr(processor, "chat_template", chat_template)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, this was moved to the get_processor_dict, as it created a situation for hacky manipulations. Example: mllama forgot to pass chat_template to the MLlamaProcessor.__init__, but still magically ended with a valid template. Took me some time tofigure it out, let's not create options for users to hack

if videos:
batch_videos.append(videos)

# Tokenizer's `apply_chat_template` never adds special tokens when tokenizing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It always adds them, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, misleading comment, it adds always yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Tokenizer's `apply_chat_template` never adds special tokens when tokenizing
# Tokenizer's `apply_chat_template` adds special tokens when tokenizing


# Tokenizer's `apply_chat_template` never adds special tokens when tokenizing
# But processor's `apply_chat_template` didn't have an option to tokenize, so users had to format the prompt
# and pass it to the processor. Users thus never worried about special tokens relying on processor hadnling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# and pass it to the processor. Users thus never worried about special tokens relying on processor hadnling
# and pass it to the processor. Users thus never worried about special tokens relying on processor handling

Actually, users had to be careful to use add_special_tokens=False when tokenizing the rendered template. Is this not the case for some VLMs?

Copy link
Member Author

@zucchini-nlp zucchini-nlp Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They had to but we never added that in the docs. For example in llava we didn't add bos in template but idefics copied llava, and added bos thus causing the problem of two special tokens

Probably we need to make this quirk available in model docs, like a small comment in demo inference code for ex

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the standard is for chat templates to handle special tokens, so idefics is doing it "right" and llava is an exception (that we have to provide BC for, of course). Perhaps we could default to the expected behaviour in our reference code, and treat exceptional cases based on some other property?

Copy link
Member Author

@zucchini-nlp zucchini-nlp Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think we'll need to:

  1. Update demo inference code when tokenize=False, and let users know why we pass add_special_tokens=False
  2. Slowly see if updating templates in llava doesn't increase comments from angry users, like after >5 minor releases. Might as well be bad idea, until we make tokenize=True a go-to default for everyone. Most users are stuck with old version of transformers
  3. For any new model, make sure the bos/eos are in the template and not hacked from within processing code!

Comment on lines +1282 to +1283
if self.tokenizer.bos_token is not None and single_prompt.startswith(self.tokenizer.bos_token):
kwargs["add_special_tokens"] = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️! Yes, I think this is the right solution - it's a little hacky, but it should cover all the cases I can think of correctly.

@zucchini-nlp
Copy link
Member Author

Cool, I will merge it then. Core modeling is not touched and code owners review should be enough

@pcuenca
Copy link
Member

pcuenca commented Feb 7, 2025

Looking forward to this landing on a release so we can show one-step processor snippets for chat VLMs! 🔥

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants