Skip to content

πŸ§‘β€πŸ³ Added Post training an VLM for reasoning with GRPO using TRL recipe #312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

sergiopaniego
Copy link
Member

@sergiopaniego sergiopaniego commented Jul 10, 2025

What does this PR do?

Fixes #311

Copy link

Check out this pull request onΒ  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link

review-notebook-app bot commented Jul 23, 2025

View / edit / reply to this conversation on ReviewNB

ariG23498 commented on 2025-07-23T06:15:39Z
----------------------------------------------------------------

I think the title should be "Post training a VLM for reasoning with GRPO using TRL" instead of "an VLM"

"In this recipe, we'll demonstrate how to post-train a" instead of "post-training".

We should have "Vision Language Model (VLM)" spelled out right at the beginning somewhere and then use "VLM"


Copy link

review-notebook-app bot commented Jul 23, 2025

View / edit / reply to this conversation on ReviewNB

ariG23498 commented on 2025-07-23T06:15:40Z
----------------------------------------------------------------

Question: Do we still use the notebook_login method? I think login works here as well.


Copy link

review-notebook-app bot commented Jul 23, 2025

View / edit / reply to this conversation on ReviewNB

ariG23498 commented on 2025-07-23T06:15:41Z
----------------------------------------------------------------

Line #4.    processor = AutoProcessor.from_pretrained(model_id, use_fast=True, padding_side="left")

Is there a particular reason to put padding_side to left?


sergiopaniego commented on 2025-07-28T16:00:38Z
----------------------------------------------------------------

Yes!

It's needed so the generations during training are concatenated directly to the input. Otherwise, we could have [PAD] gaps between the input and the generation. I have added a line explaining that since it's relevant :) Thanks for pointing it out!!

Copy link

review-notebook-app bot commented Jul 23, 2025

View / edit / reply to this conversation on ReviewNB

ariG23498 commented on 2025-07-23T06:15:41Z
----------------------------------------------------------------

Line #19.                    {"type": "image"},

I think we should also add the image to this dictionary. Something like the following:

{"type": "image", "image": example["image"]}

Copy link

review-notebook-app bot commented Jul 23, 2025

View / edit / reply to this conversation on ReviewNB

ariG23498 commented on 2025-07-23T06:15:42Z
----------------------------------------------------------------

Line #11.        # Parameters that control de data preprocessing

de -> the


@sergiopaniego sergiopaniego marked this pull request as ready for review July 28, 2025 15:33
Copy link
Member Author

Yes!

It's needed so the generations during training are concatenated directly to the input. Otherwise, we could have [PAD] gaps between the input and the generation. I have added a line explaining that since it's relevant :) Thanks for pointing it out!!


View entire conversation on ReviewNB

@sergiopaniego
Copy link
Member Author

Ready for review!
Already addressed @ariG23498's comments.

We can see that the reward goes up below:
Screenshot 2025-07-28 at 18 02 11

@@ -0,0 +1,3392 @@
{
Copy link
Collaborator

@merveenoyan merveenoyan Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sentence is a bit inverted:

For our particular case where we want the model to learn to reason using images, we use as inputΒ imageΒ andΒ problemΒ and as outputΒ solutionΒ columns.


Reply via ReviewNB

@@ -0,0 +1,3392 @@
{
Copy link
Collaborator

@merveenoyan merveenoyan Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: trainig on last sentence


Reply via ReviewNB

@@ -0,0 +1,3392 @@
{
Copy link
Collaborator

@merveenoyan merveenoyan Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

traininig* in last sentence

also more explanation on these params would be nice, i.e. what hardware limitations, for reasoning what could be the most important etc


Reply via ReviewNB

@@ -0,0 +1,3392 @@
{
Copy link
Collaborator

@merveenoyan merveenoyan Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in here we only see train loss and not reward, if we can't enable reward we could mention slightly as loss is a bit odd


Reply via ReviewNB

@sergiopaniego
Copy link
Member Author

Thanks a lot for the comments, super valuable for improvement ❀️!

Recipe improved based on feedback πŸ˜„

Copy link

@ariG23498 ariG23498 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once @merveenoyan's comments are addressed, it is okay to be merged.

This is a very nice recipe. Kudos on the work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

πŸ§‘β€πŸ³ New Post training an VLM for reasoning with GRPO using TRL recipe
4 participants