🧑‍🍳 Added `Post training an VLM for reasoning with GRPO using TRL` recipe #312

sergiopaniego · 2025-07-10T16:11:04Z

What does this PR do?

Fixes #311

review-notebook-app · 2025-07-10T16:11:10Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

HuggingFaceDocBuilderDev · 2025-07-11T10:19:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

review-notebook-app · 2025-07-23T06:15:40Z

View / edit / reply to this conversation on ReviewNB

ariG23498 commented on 2025-07-23T06:15:39Z
----------------------------------------------------------------

I think the title should be "Post training a VLM for reasoning with GRPO using TRL" instead of "an VLM"

"In this recipe, we'll demonstrate how to post-train a" instead of "post-training".

We should have "Vision Language Model (VLM)" spelled out right at the beginning somewhere and then use "VLM"

review-notebook-app · 2025-07-23T06:15:40Z

View / edit / reply to this conversation on ReviewNB

ariG23498 commented on 2025-07-23T06:15:40Z
----------------------------------------------------------------

Question: Do we still use the notebook_login method? I think login works here as well.

review-notebook-app · 2025-07-23T06:15:41Z

View / edit / reply to this conversation on ReviewNB

ariG23498 commented on 2025-07-23T06:15:41Z
----------------------------------------------------------------

Line #4.    processor = AutoProcessor.from_pretrained(model_id, use_fast=True, padding_side="left")

Is there a particular reason to put padding_side to left?

sergiopaniego commented on 2025-07-28T16:00:38Z
----------------------------------------------------------------

Yes!

It's needed so the generations during training are concatenated directly to the input. Otherwise, we could have [PAD] gaps between the input and the generation. I have added a line explaining that since it's relevant :) Thanks for pointing it out!!

review-notebook-app · 2025-07-23T06:15:42Z

View / edit / reply to this conversation on ReviewNB

ariG23498 commented on 2025-07-23T06:15:41Z
----------------------------------------------------------------

Line #19.                    {"type": "image"},

I think we should also add the image to this dictionary. Something like the following:

{"type": "image", "image": example["image"]}

review-notebook-app · 2025-07-23T06:15:42Z

View / edit / reply to this conversation on ReviewNB

ariG23498 commented on 2025-07-23T06:15:42Z
----------------------------------------------------------------

Line #11.        # Parameters that control de data preprocessing

de -> the

…dd-vlm-grpo

sergiopaniego · 2025-07-28T16:00:39Z

Yes!

It's needed so the generations during training are concatenated directly to the input. Otherwise, we could have [PAD] gaps between the input and the generation. I have added a line explaining that since it's relevant :) Thanks for pointing it out!!

View entire conversation on ReviewNB

sergiopaniego · 2025-07-28T16:02:56Z

Ready for review!
Already addressed @ariG23498's comments.

We can see that the reward goes up below:

merveenoyan · 2025-07-30T12:35:21Z

notebooks/en/fine_tuning_vlm_grpo_trl.ipynb

@@ -0,0 +1,3392 @@
+{


this sentence is a bit inverted:
For our particular case where we want the model to learn to reason using images, we use as input image and problem and as output solution columns.

Reply via ReviewNB

merveenoyan · 2025-07-30T12:35:21Z

notebooks/en/fine_tuning_vlm_grpo_trl.ipynb

@@ -0,0 +1,3392 @@
+{


nit: trainig on last sentence

Reply via ReviewNB

merveenoyan · 2025-07-30T12:35:21Z

notebooks/en/fine_tuning_vlm_grpo_trl.ipynb

@@ -0,0 +1,3392 @@
+{


traininig* in last sentence

also more explanation on these params would be nice, i.e. what hardware limitations, for reasoning what could be the most important etc

Reply via ReviewNB

merveenoyan · 2025-07-30T12:35:21Z

notebooks/en/fine_tuning_vlm_grpo_trl.ipynb

@@ -0,0 +1,3392 @@
+{


in here we only see train loss and not reward, if we can't enable reward we could mention slightly as loss is a bit odd

Reply via ReviewNB

sergiopaniego · 2025-07-30T14:05:40Z

Thanks a lot for the comments, super valuable for improvement ❤️!

Recipe improved based on feedback 😄

ariG23498

Once @merveenoyan's comments are addressed, it is okay to be merged.

This is a very nice recipe. Kudos on the work!

[WIP] Added VLM+GRPO recipe

35f980d

sergiopaniego added 2 commits July 11, 2025 12:11

Added to toctree and index

8d84576

Removed output

bf74fc4

sergiopaniego added 2 commits July 11, 2025 17:54

Working example

be9ca38

Removed output

f0627ad

sergiopaniego mentioned this pull request Jul 11, 2025

👁️ [GRPO] Add VLM training capabilities to the trainer huggingface/trl#3072

Merged

5 tasks

sergiopaniego added 3 commits July 14, 2025 13:06

Recipe updated

c71afb1

Updated with updated trainer

908bd31

Using padding left

3ab1604

sergiopaniego added 3 commits July 28, 2025 17:30

:

c695d7e

Updated with latest trl and based on comments

d7dc5df

Merge branch 'main' of https://github.com/huggingface/cookbook into a…

ea01824

…dd-vlm-grpo

sergiopaniego marked this pull request as ready for review July 28, 2025 15:33

Explaining padding_side left

0789a31

merveenoyan reviewed Jul 30, 2025

View reviewed changes

Upgraded based on review

eda3e7c

ariG23498 approved these changes Jul 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🧑‍🍳 Added `Post training an VLM for reasoning with GRPO using TRL` recipe #312

🧑‍🍳 Added `Post training an VLM for reasoning with GRPO using TRL` recipe #312

Uh oh!

sergiopaniego commented Jul 10, 2025 •

edited

Loading

Uh oh!

review-notebook-app bot commented Jul 10, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jul 11, 2025

Uh oh!

review-notebook-app bot commented Jul 23, 2025 •

edited

Loading

Uh oh!

review-notebook-app bot commented Jul 23, 2025 •

edited

Loading

Uh oh!

review-notebook-app bot commented Jul 23, 2025 •

edited

Loading

Uh oh!

review-notebook-app bot commented Jul 23, 2025 •

edited

Loading

Uh oh!

review-notebook-app bot commented Jul 23, 2025 •

edited

Loading

Uh oh!

sergiopaniego commented Jul 28, 2025

Uh oh!

sergiopaniego commented Jul 28, 2025

Uh oh!

merveenoyan Jul 30, 2025 •

edited

Loading

Uh oh!

merveenoyan Jul 30, 2025 •

edited

Loading

Uh oh!

merveenoyan Jul 30, 2025 •

edited

Loading

Uh oh!

merveenoyan Jul 30, 2025 •

edited

Loading

Uh oh!

sergiopaniego commented Jul 30, 2025

Uh oh!

ariG23498 left a comment

Uh oh!

Uh oh!

🧑‍🍳 Added Post training an VLM for reasoning with GRPO using TRL recipe #312

Are you sure you want to change the base?

🧑‍🍳 Added Post training an VLM for reasoning with GRPO using TRL recipe #312

Uh oh!

Conversation

sergiopaniego commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

review-notebook-app bot commented Jul 10, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jul 11, 2025

Uh oh!

review-notebook-app bot commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

review-notebook-app bot commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

review-notebook-app bot commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

review-notebook-app bot commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

review-notebook-app bot commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sergiopaniego commented Jul 28, 2025

Uh oh!

sergiopaniego commented Jul 28, 2025

Uh oh!

merveenoyan Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

merveenoyan Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

merveenoyan Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

merveenoyan Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sergiopaniego commented Jul 30, 2025

Uh oh!

ariG23498 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

🧑‍🍳 Added `Post training an VLM for reasoning with GRPO using TRL` recipe #312

🧑‍🍳 Added `Post training an VLM for reasoning with GRPO using TRL` recipe #312

sergiopaniego commented Jul 10, 2025 •

edited

Loading

review-notebook-app bot commented Jul 23, 2025 •

edited

Loading

review-notebook-app bot commented Jul 23, 2025 •

edited

Loading

review-notebook-app bot commented Jul 23, 2025 •

edited

Loading

review-notebook-app bot commented Jul 23, 2025 •

edited

Loading

review-notebook-app bot commented Jul 23, 2025 •

edited

Loading

merveenoyan Jul 30, 2025 •

edited

Loading

merveenoyan Jul 30, 2025 •

edited

Loading

merveenoyan Jul 30, 2025 •

edited

Loading

merveenoyan Jul 30, 2025 •

edited

Loading