-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chat template: update for processor #35953
base: main
Are you sure you want to change the base?
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks good to me, I was not very thorough in chat-template code, leaving it for @Rocketknight1
Number of frames to sample uniformly. Should be passed only when `fps=None`. | ||
If not specified and `fps==None`, all frames are sampled. | ||
fps (`int`, *optional*): | ||
Number of frames to sample per second. Should be passed only when `num_frames=None`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth adding a check that just one of the args provided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think neither of them has a special priority, and users should choose only one sampling method to sample frames. Otherwise we assume it was user error, as we can't infer their actual intentions
Right, that was a question hehe, agreed and added yes. I'll look again where exactly it fits the best :)
if fps is not None and num_frames is not None: | ||
raise ValueError("`num_frames` and `fps` are mutually exclusive arguments, please use only one!") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, I see you are doing a check here.. probably better to delegate it to the function itself, but it's up to you
# Load with `video_fps` and `num_frames` args, should raise an error | ||
with self.assertRaises(ValueError): | ||
out_dict_with_video = processor.apply_chat_template( | ||
messages, | ||
add_generation_prompt=True, | ||
tokenize=True, | ||
return_dict=True, | ||
video_fps=video_fps, | ||
num_frames=num_frames, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding a test for the error raised!
if chat_template is not None: | ||
setattr(processor, "chat_template", chat_template) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, this was moved to the get_processor_dict
, as it created a situation for hacky manipulations. Example: mllama forgot to pass chat_template
to the MLlamaProcessor.__init__
, but still magically ended with a valid template. Took me some time tofigure it out, let's not create options for users to hack
if videos: | ||
batch_videos.append(videos) | ||
|
||
# Tokenizer's `apply_chat_template` never adds special tokens when tokenizing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It always adds them, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, misleading comment, it adds always yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Tokenizer's `apply_chat_template` never adds special tokens when tokenizing | |
# Tokenizer's `apply_chat_template` adds special tokens when tokenizing |
|
||
# Tokenizer's `apply_chat_template` never adds special tokens when tokenizing | ||
# But processor's `apply_chat_template` didn't have an option to tokenize, so users had to format the prompt | ||
# and pass it to the processor. Users thus never worried about special tokens relying on processor hadnling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# and pass it to the processor. Users thus never worried about special tokens relying on processor hadnling | |
# and pass it to the processor. Users thus never worried about special tokens relying on processor handling |
Actually, users had to be careful to use add_special_tokens=False
when tokenizing the rendered template. Is this not the case for some VLMs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They had to but we never added that in the docs. For example in llava we didn't add bos
in template but idefics copied llava, and added bos
thus causing the problem of two special tokens
Probably we need to make this quirk available in model docs, like a small comment in demo inference code for ex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the standard is for chat templates to handle special tokens, so idefics is doing it "right" and llava is an exception (that we have to provide BC for, of course). Perhaps we could default to the expected behaviour in our reference code, and treat exceptional cases based on some other property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I think we'll need to:
- Update demo inference code when
tokenize=False
, and let users know why we passadd_special_tokens=False
- Slowly see if updating templates in llava doesn't increase comments from angry users, like after >5 minor releases. Might as well be bad idea, until we make
tokenize=True
a go-to default for everyone. Most users are stuck with old version of transformers - For any new model, make sure the
bos/eos
are in the template and not hacked from within processing code!
if self.tokenizer.bos_token is not None and single_prompt.startswith(self.tokenizer.bos_token): | ||
kwargs["add_special_tokens"] = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️! Yes, I think this is the right solution - it's a little hacky, but it should cover all the cases I can think of correctly.
Cool, I will merge it then. Core modeling is not touched and code owners review should be enough |
Looking forward to this landing on a release so we can show one-step processor snippets for chat VLMs! 🔥 |
What does this PR do?
Prerequisite: we need images as batched list support for all VLMs, which is now done thanks to Yoni 💛
This PR adds:
add_special_tokens
as per @Rocketknight1's requestchat_template
in processor, not a dummy onevideo_fps
which is claimed to be better than sampling N frames uniformly, especially for longer videos. NOTE: that will result in videos of differentnum_frames
and can lead to errors when batching, so should be used only whenbs=1
. We don't support videos of different frame counts, yet