Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
209 commits
Select commit Hold shift + click to select a range
552e899
Refactor image handling: replace `image_split_sizes` with `image_grid…
qgallouedec Sep 19, 2025
449ef07
simpler
qgallouedec Sep 19, 2025
c8933aa
gfpo
qgallouedec Sep 19, 2025
229c554
multi-image grpo
qgallouedec Sep 19, 2025
3ca6ad5
log with wandb
qgallouedec Sep 19, 2025
dcf4b92
no vlm reward models
qgallouedec Sep 20, 2025
30ad7ca
rloo
qgallouedec Sep 20, 2025
86cc30b
gfpo
qgallouedec Sep 20, 2025
088897b
fix
qgallouedec Sep 20, 2025
d2adc63
test peft
qgallouedec Sep 20, 2025
f4c82bf
fix gfpo
qgallouedec Sep 20, 2025
1257796
rloo test
qgallouedec Sep 20, 2025
099a39b
peft rloo
qgallouedec Sep 20, 2025
529add6
oops
qgallouedec Sep 20, 2025
fc6b11f
update test
qgallouedec Sep 20, 2025
ae1f497
generate method
qgallouedec Sep 20, 2025
f998432
debug
qgallouedec Sep 20, 2025
fa73876
skip failing test
qgallouedec Sep 20, 2025
52d8bd9
Merge branch 'main' into drop-image_split_sizes
qgallouedec Sep 20, 2025
dfc0d38
Merge branch 'drop-image_split_sizes' into multi-image-support
qgallouedec Sep 20, 2025
fc52e68
test fixed!
qgallouedec Sep 20, 2025
4d12aeb
Merge branch 'multi-image-support' into generate-method
qgallouedec Sep 20, 2025
4fc2b5b
gfpo
qgallouedec Sep 20, 2025
b628744
rm vllm
qgallouedec Sep 20, 2025
d3a769f
fix doc
qgallouedec Sep 20, 2025
e17ec42
Merge branch 'main' into drop-image_split_sizes
qgallouedec Sep 22, 2025
efbb03a
Merge branch 'drop-image_split_sizes' into multi-image-support
qgallouedec Sep 22, 2025
562c662
Merge branch 'main' into multi-image-support
qgallouedec Sep 22, 2025
485781c
Merge branch 'main' into multi-image-support
qgallouedec Sep 22, 2025
05270f8
update layers to ignore
qgallouedec Sep 22, 2025
1c53094
clarify image column desc
qgallouedec Sep 22, 2025
9b6652e
rm VLM x RM warning
qgallouedec Sep 23, 2025
c500440
Merge branch 'multi-image-support' into generate-method
qgallouedec Sep 23, 2025
a6a8c44
Merge branch 'main' into generate-method
qgallouedec Sep 23, 2025
d8665e1
Merge branch 'main' into generate-method
qgallouedec Sep 23, 2025
365d501
Merge branch 'main' into generate-method
qgallouedec Sep 23, 2025
cdb4c76
Merge branch 'main' into generate-method
qgallouedec Sep 24, 2025
c83e710
same for rloo
qgallouedec Sep 24, 2025
ec6ad25
nits style and align
qgallouedec Sep 24, 2025
b4cadde
Merge branch 'main' into generate-method
qgallouedec Sep 24, 2025
b0dceb9
restart
qgallouedec Sep 25, 2025
ebe32c2
progress
qgallouedec Sep 25, 2025
0213662
progress continues
qgallouedec Sep 25, 2025
8b3a724
progress again again
qgallouedec Sep 25, 2025
c1ae6aa
back to working point
qgallouedec Sep 25, 2025
1a66b43
revert chage data utils
qgallouedec Sep 25, 2025
2dc69a6
Merge branch 'main' into generate-method
qgallouedec Sep 26, 2025
9435a94
refactor in grpo
qgallouedec Sep 26, 2025
d3f1d3c
Merge branch 'main' into refactor_generate
qgallouedec Sep 26, 2025
3d8ea27
wrong merge commit
qgallouedec Sep 26, 2025
27dc958
fix num_input_tokens_seen
qgallouedec Sep 26, 2025
53772ef
getting closer
qgallouedec Sep 26, 2025
8766fa5
consistent naming
qgallouedec Sep 26, 2025
236b78b
better
qgallouedec Sep 26, 2025
9da4830
simplify a bit + comment
qgallouedec Sep 26, 2025
b3bd0b0
another one
qgallouedec Sep 26, 2025
d79b9e1
get prompt ids from generation
qgallouedec Sep 26, 2025
8d34d54
remove pad token removal
qgallouedec Sep 26, 2025
e770efe
Merge branch 'refactor_generate' into refactor_generate_2
qgallouedec Sep 26, 2025
0e2ae34
rely on generator for prompt truncation
qgallouedec Sep 26, 2025
46d8eb7
revert
qgallouedec Sep 26, 2025
11acc75
rm enforce eager
qgallouedec Sep 26, 2025
acee7d8
rm truncate_with_protected_tokens
qgallouedec Sep 26, 2025
0b5865e
ensure proper truncation and side
qgallouedec Sep 26, 2025
d8af003
rm useless comment
qgallouedec Sep 26, 2025
fc263a3
rm imports
qgallouedec Sep 26, 2025
35f99fd
requires padding
qgallouedec Sep 26, 2025
8149d05
rm truncation test
qgallouedec Sep 26, 2025
9925199
move forward_kwargs outside of generate
qgallouedec Sep 26, 2025
48a1c30
don't re-prepare data
qgallouedec Sep 26, 2025
15c6620
refactor: update prepare_multimodal_messages to accept images directl…
qgallouedec Sep 26, 2025
55a2480
rloo + doc
qgallouedec Sep 26, 2025
c8041e1
Merge branch 'refactor_generate' into refactor_generate_2
qgallouedec Sep 26, 2025
b8c0c9b
Merge branch 'refactor_generate_2' into refactor_generate_3
qgallouedec Sep 26, 2025
7b7a11d
test and doc
qgallouedec Sep 27, 2025
c5064d6
gfpo
qgallouedec Sep 27, 2025
effb41b
Merge branch 'main' into refactor_generate
qgallouedec Sep 27, 2025
e82bfb4
Merge branch 'main' into refactor_generate
qgallouedec Sep 27, 2025
4b9c126
Merge branch 'refactor_generate' into refactor_generate_2
qgallouedec Sep 27, 2025
3f02702
Merge branch 'refactor_generate_2' into refactor_generate_3
qgallouedec Sep 27, 2025
b0e0279
Merge branch 'refactor_generate_3' into refactor_generate_4
qgallouedec Sep 27, 2025
a01b9ca
Merge branch 'refactor_generate_4' into refactor_generate_5
qgallouedec Sep 27, 2025
6bc15a3
wip
qgallouedec Sep 28, 2025
f11759e
Merge branch 'main' into refactor_generate_2
qgallouedec Sep 30, 2025
e7aa945
fix vllm client server
qgallouedec Sep 30, 2025
e164ec5
repicate all_prompt_ids
qgallouedec Oct 1, 2025
49577ad
Same for RLOO
qgallouedec Oct 1, 2025
5fca5b8
fix normal generation path
qgallouedec Oct 1, 2025
5cc6af5
Merge branch 'refactor_generate_2' into refactor_generate_3
qgallouedec Oct 1, 2025
4dce145
remove vision tokens
qgallouedec Oct 1, 2025
ddfd3b5
same for rloo
qgallouedec Oct 1, 2025
c434fa2
truncation_side=left
qgallouedec Oct 1, 2025
377b081
rm test_training_vlm_and_prompt_truncation
qgallouedec Oct 1, 2025
d599c20
Merge branch 'main' into refactor_generate_2
qgallouedec Oct 1, 2025
e82db74
🔣 Fix test: replace `trainer.tokenizer` by `trainer.processing_class`…
qgallouedec Oct 1, 2025
192deb3
Fix CI ImportError: FlashAttention2 and decorator order for all param…
albertvillanova Oct 1, 2025
cf9d8e7
Hotfix wrong formatting of docstrings with blockquote tips (#4187)
albertvillanova Oct 1, 2025
f9c3c3c
🌡️ Have vLLM return processed (temperature scaled) log probs (#4163)
YonatanGideoni Oct 1, 2025
6489479
Replace remaining trainer.tokenizer with trainer.processing_class in …
albertvillanova Oct 3, 2025
21a67fc
[DOCS] Lora without regret (#4181)
burtenshaw Oct 3, 2025
c1e7ad2
[DOCS/FIX] lora without regrets - fix lr (#4207)
burtenshaw Oct 6, 2025
5d34144
Remove custome_container for building the docs (#4198)
albertvillanova Oct 6, 2025
ae2a0e7
Remove tokenizer creation from `sft` example script (#4197)
sergiopaniego Oct 6, 2025
6543f51
Hotfix: Exclude transformers 4.57.0 for Python 3.9 (#4209)
albertvillanova Oct 6, 2025
8319ce0
Replace unittest with pytest (#4188)
albertvillanova Oct 6, 2025
4fdaa4c
Updated vLLM integration guide (#4162)
sergiopaniego Oct 6, 2025
d258e36
Remove `Optional` from `processing_class` in `PPOTrainer` (#4212)
sergiopaniego Oct 6, 2025
7f5b499
Replace setup with pyproject and fix packaging unintended modules (#4…
albertvillanova Oct 6, 2025
df386f9
Merge branch 'main' into refactor_generate_2
qgallouedec Oct 6, 2025
5b9a6ab
Merge branch 'main' into refactor_generate_2
qgallouedec Oct 6, 2025
766bbce
Merge branch 'refactor_generate_2' into refactor_generate_3
qgallouedec Oct 6, 2025
ac2717f
Merge branch 'refactor_generate_3' into refactor_generate_4
qgallouedec Oct 6, 2025
4a274d5
Merge branch 'main' into refactor_generate_2
qgallouedec Oct 6, 2025
db552be
Merge branch 'refactor_generate_2' into refactor_generate_3
qgallouedec Oct 6, 2025
2c012dc
Merge branch 'refactor_generate_3' into refactor_generate_4
qgallouedec Oct 6, 2025
cb1d420
Merge branch 'refactor_generate_4' into refactor_generate_5
qgallouedec Oct 6, 2025
a84325c
style
qgallouedec Oct 6, 2025
34034e7
Merge branch 'refactor_generate_3' into refactor_generate_4
qgallouedec Oct 6, 2025
2ce6c1f
token_type_ids and RLOO
qgallouedec Oct 6, 2025
ddf3405
gfpo
qgallouedec Oct 6, 2025
e3c679c
style
qgallouedec Oct 6, 2025
ee03478
remove test case for prompt truncation
qgallouedec Oct 7, 2025
ed54e2a
Merge branch 'refactor_generate_3' into refactor_generate_4
qgallouedec Oct 7, 2025
5e4a026
Merge branch 'refactor_generate_4' into refactor_generate_5
qgallouedec Oct 7, 2025
45290c9
Merge branch 'main' into refactor_generate_3
qgallouedec Oct 7, 2025
a0ee1e6
Merge branch 'refactor_generate_3' into refactor_generate_4
qgallouedec Oct 7, 2025
f6e7c20
Merge branch 'refactor_generate_4' into refactor_generate_5
qgallouedec Oct 7, 2025
919ff5b
Merge branch 'main' into refactor_generate_5
qgallouedec Oct 17, 2025
fe11512
dedup and some fixes
qgallouedec Oct 18, 2025
c0c8807
fix style
qgallouedec Oct 18, 2025
ba8b938
rloo
qgallouedec Oct 18, 2025
7a2936e
style
qgallouedec Oct 18, 2025
1a6f040
test
qgallouedec Oct 18, 2025
b5c0078
Merge branch 'refactor_generate_5' into tool-call-finally
qgallouedec Oct 18, 2025
26ffb04
style
qgallouedec Oct 18, 2025
ced5450
safe prepare_multimodal_messages_vllm
qgallouedec Oct 18, 2025
23d13f9
oops
qgallouedec Oct 18, 2025
f98fe13
Merge branch 'refactor_generate_5' into tool-call-finally
qgallouedec Oct 18, 2025
5f87ee9
fix return-dict
qgallouedec Oct 18, 2025
89cff94
Merge branch 'refactor_generate_5' into tool-call-finally
qgallouedec Oct 18, 2025
0dac326
Merge branch 'main' into tool-call-finally
qgallouedec Oct 22, 2025
14afe75
Merge branch 'main' into tool-call-finally
qgallouedec Oct 28, 2025
ddcbbae
Merge branch 'main' into tool-call-finally
qgallouedec Oct 28, 2025
cb16cab
Merge branch 'main' into tool-call-finally
qgallouedec Oct 31, 2025
9102ba3
Merge branch 'main' into tool-call-finally
qgallouedec Nov 14, 2025
2d945f2
move extraction to util + doc
qgallouedec Nov 14, 2025
65ad930
using response parser
qgallouedec Nov 15, 2025
67e8f29
backward compat
qgallouedec Nov 15, 2025
a4eac3c
fixes
qgallouedec Nov 15, 2025
1e32b0a
don't truncate prompt
qgallouedec Nov 15, 2025
e816ef4
remove max_length
qgallouedec Nov 17, 2025
400bee4
move to chat template utils
qgallouedec Nov 17, 2025
b86483c
tool mask
qgallouedec Nov 17, 2025
93c7999
hard coded chat template
qgallouedec Nov 17, 2025
24ea4a4
almost done!!
qgallouedec Nov 18, 2025
5edee5c
Merge branch 'main' into tool-call-finally
qgallouedec Nov 18, 2025
9dfc511
fix chat template
qgallouedec Nov 18, 2025
2542320
just report error (not the traceback
qgallouedec Nov 18, 2025
1db53c1
style
qgallouedec Nov 18, 2025
f31996a
deprecate max_length + chat utils doc
qgallouedec Nov 18, 2025
6f2524d
test chat template utils
qgallouedec Nov 18, 2025
eb9eca9
test
qgallouedec Nov 19, 2025
19fa924
remove max_prompt_length
qgallouedec Nov 19, 2025
278703e
better doc
qgallouedec Nov 19, 2025
6828ba2
doc example and skip version below dev
qgallouedec Nov 19, 2025
ae653d8
fix overlong case
qgallouedec Nov 19, 2025
96387b3
test parse
qgallouedec Nov 19, 2025
714b9ea
example in the doc
qgallouedec Nov 19, 2025
3a1c7fb
comment in test
qgallouedec Nov 19, 2025
a1ebcba
version.parse -> Version
qgallouedec Nov 19, 2025
c340f52
comment chat template for vllm
qgallouedec Nov 19, 2025
d338c84
qol
qgallouedec Nov 19, 2025
f8444df
use chat template arg instead of ugly patch
qgallouedec Nov 19, 2025
6ac02e0
refactor: simplify response parsing in tokenizer and trainer
qgallouedec Nov 19, 2025
b8125bf
why it doesn't render well?
qgallouedec Nov 20, 2025
be255df
Merge branch 'main' into tool-call-finally
qgallouedec Nov 20, 2025
37d77ba
raw
qgallouedec Nov 20, 2025
a136592
style
qgallouedec Nov 20, 2025
e63a46c
fix: update xfail reason for tool parsing in TestParseResponse
qgallouedec Nov 20, 2025
d082309
revert rloo for now
qgallouedec Nov 20, 2025
0707baa
grpo with replay buffer
qgallouedec Nov 20, 2025
753d70d
jmespath dep
qgallouedec Nov 20, 2025
06414f2
is_jmespath_available
qgallouedec Nov 20, 2025
21792da
style
qgallouedec Nov 20, 2025
850a9eb
new section
qgallouedec Nov 20, 2025
438b586
ignore TestParseResponse for transformers<5
qgallouedec Nov 20, 2025
1c026ce
fix qwen schema
qgallouedec Nov 20, 2025
c54bf4f
another fix
qgallouedec Nov 20, 2025
9f0aa3d
remove unsused schemas
qgallouedec Nov 20, 2025
fbb625f
rename processor to tokenizer in add_response_schema function
qgallouedec Nov 20, 2025
ce6341b
deprecate max_prompt_length argument and add warning for future removal
qgallouedec Nov 20, 2025
493881f
Apply suggestions from code review
qgallouedec Nov 20, 2025
4d6a064
nit simplification
qgallouedec Nov 20, 2025
5a9bb20
Docs updated
sergiopaniego Nov 20, 2025
90a1ed1
Add monkey-patch for vLLM compatibility with TRL
qgallouedec Nov 20, 2025
a584e42
VLLM_LOGGING_LEVEL", "ERROR
qgallouedec Nov 20, 2025
fb4c694
Merge branch 'main' into tool-call-finally
qgallouedec Nov 21, 2025
aa2615a
Merge branch 'main' into tool-call-finally
qgallouedec Nov 23, 2025
c36ea41
Merge branch 'main' into tool-call-finally
qgallouedec Nov 25, 2025
caf1ad2
flip tool mask
qgallouedec Nov 25, 2025
94c2ff2
isolate tool call loop
qgallouedec Nov 25, 2025
3cbb28e
Add example script
sergiopaniego Nov 25, 2025
6074ade
code quality
sergiopaniego Nov 25, 2025
fc3d759
Update to more strict reward funcs
sergiopaniego Nov 25, 2025
e37508d
Update steps
sergiopaniego Nov 25, 2025
af749c1
Clarify token counting in reward metrics and adjust completion length…
qgallouedec Nov 25, 2025
988efc1
Updated example script with elaborated reward funcs
sergiopaniego Nov 27, 2025
ce7d607
Add example notebook and update docs
sergiopaniego Dec 1, 2025
6f65553
Merge branch 'main' into tool-call-finally
qgallouedec Dec 2, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 10 additions & 6 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -69,18 +69,22 @@
title: LoRA Without Regret
title: Examples
- sections:
- sections:
- local: chat_template_utils
title: Chat Template Utilities
- local: data_utils
title: Data Utilities
- local: model_utils
title: Model Utilities
- local: script_utils
title: Script Utilities
title: Utilities
- local: models
title: Model Classes
- local: model_utils
title: Model Utilities
- local: callbacks
title: Callbacks
- local: data_utils
title: Data Utilities
- local: rewards
title: Reward Functions
- local: script_utils
title: Script Utilities
- local: others
title: Others
title: API
Expand Down
17 changes: 17 additions & 0 deletions docs/source/chat_template_utils.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Chat template utilities

## add_response_schema

[[autodoc]] chat_template_utils.add_response_schema

## is_chat_template_prefix_preserving

[[autodoc]] chat_template_utils.is_chat_template_prefix_preserving

## get_training_chat_template

[[autodoc]] chat_template_utils.get_training_chat_template

## parse_response

[[autodoc]] chat_template_utils.parse_response
1 change: 1 addition & 0 deletions docs/source/example_overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ Scripts are maintained in the [`trl/scripts`](https://github.com/huggingface/trl
| [`examples/scripts/evals/judge_tldr.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/evals/judge_tldr.py) | This script shows how to use [`experimental.judges.HfPairwiseJudge`] or [`experimental.judges.OpenAIPairwiseJudge`] to judge model generations. |
| [`examples/scripts/gkd.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/gkd.py) | This script shows how to use the [`experimental.gkd.GKDTrainer`] to fine-tune a model. |
| [`trl/scripts/grpo.py`](https://github.com/huggingface/trl/blob/main/trl/scripts/grpo.py) | This script shows how to use the [`GRPOTrainer`] to fine-tune a model. |
| [`trl/scripts/grpo_agent.py`](https://github.com/huggingface/trl/blob/main/trl/scripts/grpo_agent.py) | This script shows how to use the [`GRPOTrainer`] to fine-tune a model to enable agentic usage. |
| [`examples/scripts/grpo_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/grpo_vlm.py) | This script shows how to use the [`GRPOTrainer`] to fine-tune a multimodal model for reasoning using the [lmms-lab/multimodal-open-r1-8k-verified](https://huggingface.co/datasets/lmms-lab/multimodal-open-r1-8k-verified) dataset. |
| [`examples/scripts/gspo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/gspo.py) | This script shows how to use GSPO via the [`GRPOTrainer`] to fine-tune model for reasoning using the [AI-MO/NuminaMath-TIR](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) dataset. |
| [`examples/scripts/gspo_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/gspo_vlm.py) | This script shows how to use GSPO via the [`GRPOTrainer`] to fine-tune a multimodal model for reasoning using the [lmms-lab/multimodal-open-r1-8k-verified](https://huggingface.co/datasets/lmms-lab/multimodal-open-r1-8k-verified) dataset. |
Expand Down
86 changes: 69 additions & 17 deletions docs/source/grpo_trainer.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,14 +141,14 @@ This constant is recommended to be the maximum completion length. To use this fo

While training and evaluating, we record the following reward metrics:

- `num_tokens`: The total number of tokens processed so far, including both prompts and completions.
- `num_tokens`: The total number of tokens processed so far, including both prompts and completions. When using tools, only non-tool tokens are counted.
- `step_time`: The average time (in seconds) taken per training step (including generation).
- `completions/mean_length`: The average length of generated completions.
- `completions/min_length`: The minimum length of generated completions.
- `completions/max_length`: The maximum length of generated completions.
- `completions/mean_terminated_length`: The average length of generated completions that terminate with EOS.
- `completions/min_terminated_length`: The minimum length of generated completions that terminate with EOS.
- `completions/max_terminated_length`: The maximum length of generated completions that terminate with EOS.
- `completions/mean_length`: The average length of generated completions. When using tools, only non-tool tokens are counted.
- `completions/min_length`: The minimum length of generated completions. When using tools, only non-tool tokens are counted.
- `completions/max_length`: The maximum length of generated completions. When using tools, only non-tool tokens are counted.
- `completions/mean_terminated_length`: The average length of generated completions that terminate with EOS. When using tools, only non-tool tokens are counted.
- `completions/min_terminated_length`: The minimum length of generated completions that terminate with EOS. When using tools, only non-tool tokens are counted.
- `completions/max_terminated_length`: The maximum length of generated completions that terminate with EOS. When using tools, only non-tool tokens are counted.
- `completions/clipped_ratio`: The ratio of truncated (clipped) completions.
- `reward/{reward_func_name}/mean`: The average reward from a specific reward function.
- `reward/{reward_func_name}/std`: The standard deviation of the reward from a specific reward function.
Expand Down Expand Up @@ -546,6 +546,68 @@ and the reward will be computed as the sum of the rewards from each function, or

Note that [`GRPOTrainer`] supports multiple reward functions of different types. See the parameters documentation for more details.

## Agent Training

GRPO supports **agent training** through the `tools` argument in [`GRPOTrainer`].
This parameter expects a list of Python functions that define the tools available to the agent:

```python
from trl import GRPOTrainer

trainer = GRPOTrainer(
tools=[tool1, tool2],
...,
)
```

Each tool must be a standard Python function with **type-hinted arguments and return types**, along with a **Google-style docstring** describing its purpose, arguments, and return value.
For more details, see the [Passing tools guide](https://huggingface.co/docs/transformers/en/chat_extras#passing-tools).

Example:

```python
from trl import GRPOTrainer

def multiply(a: int, b: int) -> int:
"""
Multiplies two integers.

Args:
a: The first integer.
b: The second integer.

Returns:
The product of the two integers.
"""
return a * b

trainer = GRPOTrainer(
tools=[multiply],
...,
)
```

### Supported Models

Tested with:

- **Qwen3** — e.g., `Qwen/Qwen3-0.6B`

> [!TIP]
> Compatibility with all LLMs is not guaranteed. If you believe a model should be supported, feel free to open an issue on GitHub — or better yet, submit a pull request with the required changes.

### Quick Start

Use [grpo\_agent.py](https://github.com/huggingface/trl/blob/main/examples/scripts/grpo_agent.py) to fine-tune a LLM for agentic workflows.

```bash
accelerate launch \
--config_file=examples/accelerate_configs/deepspeed_zero3.yaml \
examples/scripts/grpo_agent.py \
--model_name_or_path Qwen/Qwen3-0.6B
...
```

## Vision-Language Model (VLM) Training

GRPO supports training Vision-Language Models (VLMs) on multimodal datasets containing both text and images.
Expand Down Expand Up @@ -576,7 +638,6 @@ accelerate launch \
--learning_rate 1e-5 \
--gradient_checkpointing \
--dtype bfloat16 \
--max_prompt_length 2048 \
--max_completion_length 1024 \
--use_vllm \
--vllm_mode colocate \
Expand All @@ -587,15 +648,6 @@ accelerate launch \

### Configuration Tips

> [!TIP]
> For VLMs, truncating may remove image tokens, leading to errors during training. To avoid this, set `max_prompt_length=None` in the [`GRPOConfig`]. This allows the model to process the full sequence length without truncating image tokens.
>
> ```python
> GRPOConfig(max_prompt_length=None, ...)
> ```
>
> Only use `max_prompt_length` when you've verified that truncation won't remove image tokens for the entire dataset.

- Use LoRA on vision-language projection layers
- Enable 4-bit quantization to reduce memory usage
- VLMs are memory-intensive — start with smaller batch sizes
Expand Down
2 changes: 0 additions & 2 deletions docs/source/lora_without_regret.md
Original file line number Diff line number Diff line change
Expand Up @@ -199,7 +199,6 @@ hf jobs uv run \
--warmup_ratio 0.0 \
--max_grad_norm 1.0 \
--beta 0.0 \
--max_prompt_length 1024 \
--max_completion_length 4096 \
--num_generations 16 \
--generation_batch_size 16 \
Expand Down Expand Up @@ -234,7 +233,6 @@ uv run "https://huggingface.co/datasets/burtenshaw/lora-without-regrets/resolve/
--warmup_ratio 0.0 \
--max_grad_norm 1.0 \
--beta 0.0 \
--max_prompt_length 1024 \
--max_completion_length 4096 \
--num_generations 16 \
--generation_batch_size 16 \
Expand Down
1 change: 0 additions & 1 deletion docs/source/paper_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,6 @@ training_args = GRPOConfig(
loss_type="dr_grpo",
per_device_train_batch_size=1, # train_batch_size_per_device in the Training section of the repository
num_generations=8, # num_samples in the Training section of the repository
max_prompt_length=1024, # prompt_max_length in the Training section of the repository
max_completion_length=3000, # generate_max_length in the Training section of the repository
beta=0.0, # beta in the Training section of the repository
)
Expand Down
1 change: 0 additions & 1 deletion docs/source/rapidfire_integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,6 @@ from rapidfireai.automl import RFGRPOConfig
training_args = RFGRPOConfig(
learning_rate=5e-6,
num_generations=8,
max_prompt_length=256,
max_completion_length=256,
# ... all other GRPOConfig parameters supported
)
Expand Down
1 change: 1 addition & 0 deletions examples/notebooks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ This directory contains a collection of Jupyter notebooks that demonstrate how t

| Notebook | Description | Open in Colab |
| --- | --- | --- |
| [`grpo_agent.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/grpo_agent.ipynb) | GRPO for agent training | Not available due to OOM with Colab GPUs |
| [`openenv_wordle_grpo.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/openenv_wordle_grpo.ipynb) | GRPO to play Worldle on an OpenEnv environment | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/openenv_wordle_grpo.ipynb) |
| [`sft_trl_lora_qlora.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/sft_trl_lora_qlora.ipynb) | Supervised Fine-Tuning (SFT) using QLoRA on free Colab | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/sft_trl_lora_qlora.ipynb) |
| [`sft_qwen_vl.ipynb`](https://github.com/huggingface/trl/tree/main/examples/notebooks/sft_qwen_vl.ipynb) | Supervised Fine-Tuning (SFT) Qwen3-VL with QLoRA using TRL on free Colab | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/sft_qwen_vl.ipynb) |
Expand Down
Loading
Loading