🕵️‍♂️ Agent training #4300

qgallouedec · 2025-10-18T01:00:54Z

What does this PR do?

This PR implements tool calling for GRPO. The API is as follows:

⚠️ requires transformers v5.0.0.dev0

from datasets import Dataset
from trl import GRPOTrainer

def multiply(a: int, b: int) -> int:
    """
    Multiplies two integers.

    Args:
        a: The first integer.
        b: The second integer.

    Returns:
        The product of the two integers.
    """
    return a * b


dataset = Dataset.from_list(
    [
        {"prompt": [{"role": "user", "content": "What is 3 multiplied by 4?"}], "answer": 12},
        {"prompt": [{"role": "user", "content": "Calculate 7 times 8."}], "answer": 56},
        {"prompt": [{"role": "user", "content": "Find the product of 5 and 6."}], "answer": 30},
        {"prompt": [{"role": "user", "content": "What do you get when you multiply 9 by 9?"}], "answer": 81},
        {"prompt": [{"role": "user", "content": "Compute 12 multiplied by 11."}], "answer": 132},
        {"prompt": [{"role": "user", "content": "What is 15 times 14?"}], "answer": 210},
    ]
)

def accuracy(completions, answer, **kwargs):
    rewards = []
    for completion, ans in zip(completions, answer):
        if str(ans) in completion[-1]["content"]:
            rewards.append(1.0)
        else:
            rewards.append(0.0)
    return rewards

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    train_dataset=dataset,
    tools=[multiply],
    reward_funcs=accuracy,
)
trainer.train()

This PR contains a few important changes:

🚨 Removal of `max_prompt_length`

This PR contains a breaking change: max_prompt_length has been removed from GRPO.

Here are the reasons: (tldr: because it’s extremely hard to implement reliably with multi-turn tool calling, likely harmful to training anyway, likely not used in practice, and dropping it simplifies the API while keeping it consistent across LLMs and VLMs.)

Supporting max_prompt_length with tool calling is extremely complex.
For single-turn generation it works fine, but multi-turn generation introduces a major challenge: the prompt grows after every step. Since the model is called repeatedly with an increasingly long prompt, we would need to recalculate the allowed prompt length dynamically based on how many tokens have already been generated. Implementing this reliably is tricky and adds significant complexity.
Truncating prompts is likely worse than dropping samples altogether.
Although I’m not aware of formal studies, intuition suggests that truncation can remove information necessary to solve the task. Training on such incomplete examples can lead to strong biases, whereas simply skipping overly long samples avoids this risk.
It simplifies the API and removes confusing edge cases.
Previously, when training VLMs, we had to tell users to disable prompt truncation entirely because Transformers does not support truncating multimodal prompts. This led to inconsistent, non-user-friendly recommendations. Removing max_prompt_length allows us to provide one clean, unified API that works for all model types.
It very likely not a widely used feature anyway

Online decoding

Before calling the reward function, we need to decode the completion. Previously, this was done here:

trl/trl/trainer/grpo_trainer.py

Lines 1605 to 1617 in 1a9ff52

    
           # Decode 
        
           prompts_text = self.processing_class.batch_decode(prompt_ids, skip_special_tokens=True) 
        
           completions_text = self.processing_class.batch_decode(completion_ids, skip_special_tokens=True) 
        
           if is_conversational(inputs[0]): 
        
               completions = [] 
        
               for prompt, completion in zip(prompts, completions_text, strict=True): 
        
                   bootstrap = prompt.pop()["content"] if prompt[-1]["role"] == "assistant" else "" 
        
                   if isinstance(bootstrap, list):  # for VLM, the format might be [{"type": "text", "text": "..."}] 
        
                       assert len(bootstrap) == 1 and bootstrap[0]["type"] == "text" 
        
                       bootstrap = bootstrap[0]["text"] 
        
                   completions.append([{"role": "assistant", "content": bootstrap + completion}]) 
        
           else: 
        
               completions = completions_text

The issue is that, while this works for single-turn outputs, it does not allow reliable parsing of multi-turn text. See this internal discussion. The workaround is to parse after each turn, which requires moving the decoding logic inside the generation loop (in _generate):

trl/trl/trainer/grpo_trainer.py

Lines 1483 to 1495 in c54bf4f

    
           # Decode completions. It's important to use `parse_response` when possible, because it handles tool calls. 
        
           if is_conversational({"prompt": prompts[0]}): 
        
               if ( 
        
                   Version(transformers.__version__) >= Version("5.0.0.dev0")  # parse_response added in v5 
        
                   and isinstance(self.processing_class, PreTrainedTokenizerBase)  # doesn't work with processors 
        
                   and self.processing_class.response_schema is not None  # only works if the tokenizer has a schema 
        
               ): 
        
                   completions = [[parse_response(self.processing_class, ids)] for ids in completion_ids] 
        
               else: 
        
                   contents = self.processing_class.batch_decode(completion_ids, skip_special_tokens=True) 
        
                   completions = [[{"role": "assistant", "content": content}] for content in contents] 
        
           else: 
        
               completions = self.processing_class.batch_decode(completion_ids, skip_special_tokens=True)

trl/trl/trainer/grpo_trainer.py

Line 1543 in c54bf4f

completions[idx_with_tool].append(tool_message)

trl/trl/trainer/grpo_trainer.py

Lines 1614 to 1618 in c54bf4f

    
           # Add post-tool completions to the existing completions 
        
           for idx in range(len(idxs_with_tool)): 
        
               idx_with_tool = idxs_with_tool[idx] 
        
               if post_tool_completions[idx]:  # {} if post-tool completions completely truncated 
        
                   completions[idx_with_tool].append(post_tool_completions[idx])

The method then returns the list of messages:

trl/trl/trainer/grpo_trainer.py

Line 1669 in c54bf4f

completions,

Note that this change removes support for the "bootstrap" feature. I haven’t had time to investigate adding support for it.

Tool mask

We don't want the loss to be computed on the tokens corresponding to the tool result. Consequently, _generate builds and return a tool_mask

trl/trl/trainer/grpo_trainer.py

Line 1668 in c54bf4f

tool_mask,

which is then used to mask these tokens in the loss computation.

trl/trl/trainer/grpo_trainer.py

Line 2100 in c54bf4f

    
           mask = completion_mask if not self.tools else completion_mask * (1 - inputs["tool_mask"])

Schema and fixed chat template

Chat template

To make this feature work, we need the chat template to be prefix-preserving. Ie:

trl/trl/chat_template_utils.py

Lines 195 to 212 in 9f0aa3d

    
           messages1 = [ 
        
               {"role": "user", "content": "What color is the sky?"}, 
        
           ] 
        
           messages2 = [ 
        
               {"role": "user", "content": "What color is the sky?"}, 
        
               {"role": "assistant", "content": "It is blue."}, 
        
           ] 
        
           messages3 = [ 
        
               {"role": "user", "content": "What color is the sky?"}, 
        
               {"role": "assistant", "content": "It is blue."}, 
        
               {"role": "user", "content": "And at night?"}, 
        
           ] 
        
           text1 = tokenizer.apply_chat_template(messages1, tokenize=False, add_generation_prompt=True) 
        
           text2 = tokenizer.apply_chat_template(messages2, tokenize=False) 
        
           text3 = tokenizer.apply_chat_template(messages3, tokenize=False) 
        
           return text2.startswith(text1) and text3.startswith(text2)

The issue is that some widely used tokenizers, such as GPT-OSS and Qwen3, are not prefix-preserving due to the way they handle think tokens. To address this, I suggest using a slightly modified version of the template that ensures it is prefix-preserving. Additionally, as @lewtun pointed out, it’s not even clear whether these templates might make the inference OOD

Response schema

To parse tool calls from the model’s response, we rely on tokenizer.parse_response, introduced in huggingface/transformers#40894. This requires the tokenizer to have a response_schema (integrated in a similar way as chat templates). However, very few (no?) model repositories currently include such a schema.

To enable this feature despite the lack of adoption, I propose adding a mapping for some popular chat templates to their response schemas (currently only Qwen3).

trl/trl/chat_template_utils.py

Lines 172 to 174 in fbb625f

    
           if tokenizer.chat_template == qwen3_chat_template: 
        
               tokenizer.response_schema = qwen3_schema 
        
               return tokenizer

Ideally, once adoption increases and model repos start including proper response schemas, we can remove this custom mapping entirely.

A fair amount of complexity in the generation

This PR adds 60+ lines of intricate code with many special cases in the generation method. While it’s admittedly hard to follow, after a lot of iteration this is likely the simplest reliable way to implement the feature. Normally, I would be very reluctant to introduce this level of complexity, but given the impact of this feature, I believe it’s truly worth it.

trl/trl/trainer/grpo_trainer.py

Lines 1509 to 1623 in c54bf4f

    
           # Tool execution loop: execute tools, then regenerate completions with tool results appended to the prompt 
        
           while idxs_with_tool: 
        
               prompt_completion_tools = [prompts[i] for i in idxs_with_tool]  # select only prompts that need tool calls 
        
               # Tokenize the current prompt. We will use this to filter out overlong samples later. 
        
               kwargs = { 
        
                   "tools": self.tools, 
        
                   "add_generation_prompt": True, 
        
                   "tokenize": True, 
        
                   "chat_template": self.chat_template, 
        
                   **self.chat_template_kwargs, 
        
               } 
        
               p_ids = self.processing_class.apply_chat_template(prompt_completion_tools, **kwargs)["input_ids"] 
        
               # Call the tools, and build the new prompt for generation 
        
               for idx in range(len(idxs_with_tool)): 
        
                   idx_with_tool = idxs_with_tool[idx] 
        
                   tool_call_list = tool_calls[idx] 
        
                   prompt_completion_tool = prompt_completion_tools[idx] 
        
                   prompt_completion_tool.append(completions[idx_with_tool][-1]) 
        
                   for tool_call in tool_call_list: 
        
                       tool_call_count += 1 
        
                       if tool_call["type"] == "function": 
        
                           function = tool_call["function"] 
        
                           try: 
        
                               result = self._tool_dict[function["name"]](**function["arguments"]) 
        
                           except Exception as e: 
        
                               result = {"error": str(e)} 
        
                               tool_failure_count += 1 
        
                       else: 
        
                           result = {"error": f"Unsupported tool call type: {tool_call['type']}"} 
        
                       tool_call["result"] = result 
        
                       tool_message = {"role": "tool", "name": function["name"], "content": str(result)} 
        
                       prompt_completion_tool.append(tool_message) 
        
                       completions[idx_with_tool].append(tool_message) 
        
               # Tokenize and filter samples whose length exceeds max allowed length. This is important, because if vLLM 
        
               # is called with an input longer than its max model length, it will error out. 
        
               pct_ids = self.processing_class.apply_chat_template(prompt_completion_tools, **kwargs)["input_ids"] 
        
               overlong = [len(pct) - len(p) >= self.max_completion_length for p, pct in zip(p_ids, pct_ids, strict=True)] 
        
               for idx in range(len(idxs_with_tool)): 
        
                   idx_with_tool = idxs_with_tool[idx] 
        
                   if overlong[idx]: 
        
                       prompt_length = len(prompt_ids[idx_with_tool]) 
        
                       ct = pct_ids[idx][prompt_length : prompt_length + self.max_completion_length] 
        
                       completion_ids[idx_with_tool] = ct 
        
                       tool_mask[idx_with_tool] += [0] * (len(ct) - len(tool_mask[idx_with_tool])) 
        
                       if logprobs is not None: 
        
                           logprobs[idx_with_tool] += [0.0] * (len(ct) - len(logprobs[idx_with_tool])) 
        
               idxs_with_tool = [idx for idx, o in zip(idxs_with_tool, overlong, strict=True) if not o] 
        
               prompt_completion_tools = [pct for pct, o in zip(prompt_completion_tools, overlong, strict=True) if not o] 
        
               if not idxs_with_tool: 
        
                   break  # all overlong, exit tool loop 
        
               # Generate new completions after tool execution 
        
               prompt_completion_tool_ids, post_tool_ids, post_tool_logprobs, _ = self._generate_single_turn( 
        
                   prompt_completion_tools 
        
               ) 
        
               # Sanity check: from experience, this is useful to catch bugs in the chat template 
        
               for idx in range(len(idxs_with_tool)): 
        
                   idx_with_tool = idxs_with_tool[idx] 
        
                   pct = prompt_completion_tool_ids[idx]  # = prompt-completion-tool 
        
                   assert prompt_ids[idx_with_tool] == pct[: len(prompt_ids[idx_with_tool])] 
        
               # Truncate so that pct[len(prompt_ids[idx]) :] + post_tool does not exceed max_completion_length 
        
               for idx in range(len(idxs_with_tool)): 
        
                   idx_with_tool = idxs_with_tool[idx] 
        
                   prompt_len = len(prompt_ids[idx_with_tool]) 
        
                   completion_tool_ids = prompt_completion_tool_ids[idx][prompt_len:] 
        
                   excess_length = len(completion_tool_ids) + len(post_tool_ids[idx]) - self.max_completion_length 
        
                   if excess_length > 0: 
        
                       # If exceeding max length, truncate post_tool_ids 
        
                       post_tool_ids[idx] = post_tool_ids[idx][:-excess_length] 
        
                       if logprobs is not None: 
        
                           post_tool_logprobs[idx] = post_tool_logprobs[idx][:-excess_length] 
        
                       excess_length = len(completion_tool_ids) + len(post_tool_ids[idx]) - self.max_completion_length 
        
                       if excess_length > 0: 
        
                           # If still exceeding max length, truncate completion_tool_ids as well 
        
                           prompt_completion_tool_ids[idx] = prompt_completion_tool_ids[idx][:-excess_length] 
        
               # Update tool_mask: the tool result should be 1 and the post-tool 0 
        
               for idx in range(len(idxs_with_tool)): 
        
                   idx_with_tool = idxs_with_tool[idx] 
        
                   prompt_completion_tool_length = len(prompt_completion_tool_ids[idx]) 
        
                   prompt_length = len(prompt_ids[idx_with_tool]) 
        
                   completion_length = len(completion_ids[idx_with_tool]) 
        
                   post_tool_length = len(post_tool_ids[idx]) 
        
                   tool_length = prompt_completion_tool_length - prompt_length - completion_length 
        
                   tool_mask[idx_with_tool] += [1] * tool_length + [0] * post_tool_length 
        
                   if logprobs is not None: 
        
                       logprobs[idx_with_tool] += [0.0] * tool_length + post_tool_logprobs[idx] 
        
               # Update completion_ids with the new completions (after tool execution) 
        
               for idx in range(len(idxs_with_tool)): 
        
                   idx_with_tool = idxs_with_tool[idx] 
        
                   prompt_length = len(prompt_ids[idx_with_tool]) 
        
                   pct = prompt_completion_tool_ids[idx]  # = prompt-completion-tool 
        
                   completion_ids[idx_with_tool] = pct[prompt_length:] + post_tool_ids[idx] 
        
               # Decode post-tool completions 
        
               post_tool_completions = [ 
        
                   parse_response(self.processing_class, ids) if ids else {} for ids in post_tool_ids 
        
               ] 
        
               # Add post-tool completions to the existing completions 
        
               for idx in range(len(idxs_with_tool)): 
        
                   idx_with_tool = idxs_with_tool[idx] 
        
                   if post_tool_completions[idx]:  # {} if post-tool completions completely truncated 
        
                       completions[idx_with_tool].append(post_tool_completions[idx]) 
        
               # Check for further tool calls 
        
               tool_calls = [completion.get("tool_calls") for completion in post_tool_completions] 
        
               idxs_with_tool = [idx for idx, tool_call in zip(idxs_with_tool, tool_calls, strict=True) if tool_call] 
        
               tool_calls = [tool_call for tool_call in tool_calls if tool_call]

Next steps

Add tool support to the vLLM client/server (currently supported only for collocated vLLM and Transformers generation)
Provide the same integration for RLOO

…_thw` in GRPO and RLOO trainers; update `split_pixel_values_by_grid` to use `image_grid_thw`

qgallouedec · 2025-11-20T21:12:58Z

For information:

Tokenizer	Tool Calls	Prefix-Preserving
trl-internal-testing/tiny-CohereForCausalLM	❌	✅
trl-internal-testing/tiny-DbrxForCausalLM	❌	✅
trl-internal-testing/tiny-DeepseekV3ForCausalLM	✅	❌
trl-internal-testing/tiny-DeepseekV3ForCausalLM-0528	✅	✅
trl-internal-testing/tiny-FalconMambaForCausalLM	❌	✅
trl-internal-testing/tiny-Gemma2ForCausalLM	❌	✅
trl-internal-testing/tiny-GemmaForCausalLM	❌	✅
trl-internal-testing/tiny-GptOssForCausalLM	✅	❌
trl-internal-testing/tiny-LlamaForCausalLM-3.1	✅	✅
trl-internal-testing/tiny-LlamaForCausalLM-3.2	✅	✅
trl-internal-testing/tiny-LlamaForCausalLM-3	❌	✅
trl-internal-testing/tiny-MistralForCausalLM-0.1	❌	✅
trl-internal-testing/tiny-MistralForCausalLM-0.2	❌	✅
trl-internal-testing/tiny-Phi3ForCausalLM	❌	❌
trl-internal-testing/tiny-Qwen2ForCausalLM-2.5	✅	✅
trl-internal-testing/tiny-Qwen3ForCausalLM	✅	❌

from transformers import AutoTokenizer
from trl.chat_template_utils import is_chat_template_prefix_preserving

tokenizers = [
    "trl-internal-testing/tiny-CohereForCausalLM",
    "trl-internal-testing/tiny-DbrxForCausalLM",
    "trl-internal-testing/tiny-DeepseekV3ForCausalLM",
    "trl-internal-testing/tiny-DeepseekV3ForCausalLM-0528",
    "trl-internal-testing/tiny-FalconMambaForCausalLM",
    "trl-internal-testing/tiny-Gemma2ForCausalLM",
    "trl-internal-testing/tiny-GemmaForCausalLM",
    "trl-internal-testing/tiny-GptOssForCausalLM",
    "trl-internal-testing/tiny-LlamaForCausalLM-3.1",
    "trl-internal-testing/tiny-LlamaForCausalLM-3.2",
    "trl-internal-testing/tiny-LlamaForCausalLM-3",
    "trl-internal-testing/tiny-MistralForCausalLM-0.1",
    "trl-internal-testing/tiny-MistralForCausalLM-0.2",
    "trl-internal-testing/tiny-Phi3ForCausalLM",
    "trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
    "trl-internal-testing/tiny-Qwen3ForCausalLM",
]

print(f"| Tokenizer | Tool Calls | Prefix-Preserving |")
print("| --- | --- | --- |")
for tokenizer_name in tokenizers:
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    tool_support = "✅" if "tool_call" in tokenizer.chat_template else "❌"
    prefix_preserving = "✅" if is_chat_template_prefix_preserving(tokenizer) else "❌"
    print(f"| {tokenizer_name} | {tool_support} | {prefix_preserving} |")

trl/experimental/grpo_with_replay_buffer/grpo_with_replay_buffer_trainer.py

trl/trainer/grpo_trainer.py

edbeeching · 2025-11-24T10:11:23Z

trl/trainer/grpo_trainer.py

+
+        # Extract tool calls from the completions
+        if self.tools:
+            tool_calls = [completion[0].get("tool_calls") for completion in completions]


If it is not too complicated then L1510-1625 should go in their own method called _run_tool_calls or similar

This block can then be:

if self.tools: tool_mask, completions, ... = self._run_tool_calls(...) else: tool_mask = None

It would make it simpler to parse the codebase when you are not using tools. It may make the tool call block more testable? There is a lot of stuff going on here, overall I understand the logic but there may be edge cases that slip though. Is there are way we can break up while loop into smaller methods and unit test them?

I agree with Ed that it would be great if we can refactor this somewhat to isolate the tool-calling logic and enable it to be tested independently

can this be made independent of GRPO trainer as other online trainers could then potentially benifit from this helper?

I refactored to isolate the _tool_call_loop. 94c2ff2. Now we have, as @edbeeching suggested:

if self.tools: tool_mask, ... = self._tool_call_loop(prompts, prompt_ids, completion_ids, completions, logprobs) else: tool_mask = None

much easier to read indeed.

It may make the tool call block more testable? There is a lot of stuff going on here

enable it to be tested independently

I get the motivation, and I agree that "there is a lot of stuff going on here." My take is that it would be better to split this into smaller helper functions or utilities. That makes the code easier to follow, easier to test, and more flexible overall. I'll try to do it.

However, I usually avoid testing private methods or internal logic directly, because those tests can break easily for the wrong reasons: any small change inside can force a test rewrite. I prefer testing through the public API, even if it means patching a bit to make sure the right part runs. Like here

trl/tests/test_grpo_trainer.py

Lines 1721 to 1797 in 94c2ff2

def test_training_with_tools(self):

# In this test, we define a simple tool that multiplies two integers. Regardless of the input prompt,

# the model will generate 3 completions, 2 of which will be valid tool calls. Among the 2 tool calls, one will

# succeed and the other will fail (because of a wrong argument name).

def multiply(a: int, b: int) -> int:

"""

Multiplies two integers.

Args:

a: The first integer.

b: The second integer.

Returns:

The product of the two integers.

"""

return a * b

dataset = load_dataset("trl-internal-testing/zen", "conversational_prompt_only", split="train")

training_args = GRPOConfig(

output_dir=self.tmp_dir,

learning_rate=0.1,

per_device_train_batch_size=3,

num_generations=3,

max_completion_length=128,

report_to="none",

)

trainer = GRPOTrainer(

model="trl-internal-testing/tiny-Qwen3MoeForCausalLM",

reward_funcs="trl-internal-testing/tiny-Qwen2ForSequenceClassification-2.5",

args=training_args,

train_dataset=dataset,

tools=[multiply],

)

previous_trainable_params = {n: param.clone() for n, param in trainer.model.named_parameters()}

def fake_generate(input_ids, **kwargs):

if input_ids.shape[0] == 3: # first call

# fmt: off

completion_ids = torch.tensor(

[

# '<tool_call>\n{"name": "multiply", "arguments": {"a": 3, "b": 4}}\n</tool_call><|im_end|>'

[151657, 198, 4913, 606, 788, 330, 64648, 497, 330, 16370, 788, 5212, 64, 788, 220, 18, 11, 330, 65, 788, 220, 19, 11248, 151658, 151645],

# '<tool_call>\n{"name": "multiply", "arguments": {"a": 3, "c": 4}}\n</tool_call><|im_end|>'

[151657, 198, 4913, 606, 788, 330, 64648, 497, 330, 16370, 788, 5212, 64, 788, 220, 18, 11, 330, 66, 788, 220, 19, 11248, 151658, 151645],

# "I don't know any tool<|im_end|>"

[40, 1513, 944, 1414, 894, 5392, 151645, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643],

],

device=input_ids.device,

)

# fmt: on

else: # second call will only have two inputs in the batch, because two examples have a tool call.

completion_ids = torch.tensor(

[

# 'Done!<|im_end|>'

[17453, 0, 151645],

# 'Done!<|im_end|>'

[17453, 0, 151645],

],

device=input_ids.device,

)

return torch.cat([input_ids, completion_ids], dim=-1)

with patch.object(trainer.model, "generate", side_effect=fake_generate):

trainer.train()

assert trainer.state.log_history[-1]["train_loss"] is not None

assert trainer.state.log_history[-1]["tools/call_frequency"] is not None

assert trainer.state.log_history[-1]["tools/call_frequency"] == pytest.approx(2 / 3)

assert trainer.state.log_history[-1]["tools/failure_frequency"] is not None

assert trainer.state.log_history[-1]["tools/failure_frequency"] == pytest.approx(1 / 2)

# Check that the params have changed

for n, param in previous_trainable_params.items():

new_param = trainer.model.get_parameter(n)

assert not torch.equal(param, new_param), f"Parameter {n} has not changed."

It is more difficult than it seems to find the right balance. 😅 I'd be curious to know what @albertvillanova thinks about it (when he is back).

can this be made independent of GRPO trainer as other online trainers could then potentially benefit from this helper?

It’s quite complex because a lot of the internal logic of this new method depends on other methods and attributes (self.chat_template_kwargs, self._tool_dict, self.processing_class, self._generate_single_turn, ...). In this case, I would lean more towards copying. From experience, even though this increases the amount of duplicated code in the codebase, it doesn’t really make maintenance harder, in fact, it often makes maintenance much easier. For example, RLOO and GRPO have a lot of duplicated code, but they are very easy to maintain together with very little shared abstraction.

trl/trainer/grpo_trainer.py

edbeeching · 2025-11-24T10:20:42Z

trl/trainer/grpo_trainer.py

        agg_prompt_lengths = self.accelerator.gather(prompt_lengths)
        agg_completion_lengths = self.accelerator.gather(completion_lengths)
        total_prompt_tokens = agg_prompt_lengths.sum()
        total_completion_tokens = agg_completion_lengths.sum()  # = num_items_in_batch, required for the DAPO loss


agg_completion_lengths.sum() # = num_items_in_batch
Does / should this include the tool call tokens?

It does include the tool call tokens, but shouldn't. I'll need to fix it. Excellent catch, thanks!

edbeeching

Thanks for this implementation. In general it is great, the only issue I have is that there is a lot of new code in L1510-1625 of the grpo_trainer.py file. It would be great to extract this into at least one method, or several methods for some of the blocks of code:

For example make a _run_tool_calls method which contains:

...
while idxs_with_tool:
  ... 
  # Call the tools, and build the new prompt for generation
  ... = call_the_tools(..)
  # Tokenize and filter samples whose length exceeds max allowed length.
  ... = tokenize_and_filter(...)
  # Generate new completions after tool execution
  prompt_completion_tool_ids, post_tool_ids, post_tool_logprobs, _ = self._generate_single_turn(prompt_completion_tools)
# Sanity check: from experience
... etc

Then add tests for those methods, there seem to be a lot of edge cases you have already covered, capturing those in tests would be good and will make this more robust.

The level of granularity you want with these methods is up to you of course. But at least one method for _run_tool_calls(...) would make the _generate method easier to parse, particularly for new users who are trying to underand the codebase and are not training llms to use tools (yet!).

lewtun

Epic tour de force @qgallouedec 🔥 !

Overall LGTM with some questions around the logic for truncating long multi-step rollouts in terms of max_model_length instead of max_completions_length

lewtun · 2025-11-24T11:46:27Z

trl/trainer/grpo_trainer.py

                # Ensure distributed rendezvous variables are set without colliding across concurrent runs
                ensure_master_addr_port()

-                if self.max_prompt_length is not None and self.max_completion_length is not None:


Does this mean we now use the model's default max_model_len?

If so, would it make sense to expose max_model_len as an arg in the config to accommodate cases where the sequences one is training on are much smaller than the default value? This allows one to get better throughput / lower memory usage.

lewtun · 2025-11-24T12:06:49Z

trl/trainer/grpo_trainer.py

+            # Tokenize and filter samples whose length exceeds max allowed length. This is important, because if vLLM
+            # is called with an input longer than its max model length, it will error out.
+            pct_ids = self.processing_class.apply_chat_template(prompt_completion_tools, **kwargs)["input_ids"]
+            overlong = [len(pct) - len(p) >= self.max_completion_length for p, pct in zip(p_ids, pct_ids, strict=True)]


Should this refer instead to max_model_len because technically one could have a small max_completion_length but a long context that is built up over many steps of turn calling.

lewtun · 2025-11-24T12:07:43Z

trl/trainer/grpo_trainer.py

+            for idx in range(len(idxs_with_tool)):
+                idx_with_tool = idxs_with_tool[idx]
+                pct = prompt_completion_tool_ids[idx]  # = prompt-completion-tool
+                assert prompt_ids[idx_with_tool] == pct[: len(prompt_ids[idx_with_tool])]


Should we raise a proper ValueError here with an informative error message for the user?

lewtun · 2025-11-24T12:08:53Z

trl/trainer/grpo_trainer.py

+                pct = prompt_completion_tool_ids[idx]  # = prompt-completion-tool
+                assert prompt_ids[idx_with_tool] == pct[: len(prompt_ids[idx_with_tool])]
+
+            # Truncate so that pct[len(prompt_ids[idx]) :] + post_tool does not exceed max_completion_length


Same comment here, that I don't think the issue is about exceeding the max_completion_length, rather that we should left-truncate to ensure we have enough tokens to avoid exceeding the model's context length (set via max_model_len)

lewtun · 2025-11-24T12:09:57Z

trl/trainer/grpo_trainer.py

+
+        # Extract tool calls from the completions
+        if self.tools:
+            tool_calls = [completion[0].get("tool_calls") for completion in completions]


I agree with Ed that it would be great if we can refactor this somewhat to isolate the tool-calling logic and enable it to be tested independently

trl/trainer/grpo_trainer.py

lewtun · 2025-11-24T12:16:19Z

trl/chat_template_utils.py

+        return tokenizer
+    raise ValueError(
+        "Unrecognized chat template, failed to add response schema. Please manually set the response schema on the "
+        "tokenizer or processor."


To help users, shall we add a pointer to the docs?

Suggested change

"tokenizer or processor."

"tokenizer or processor. See the Transformers [docs](https://huggingface.co/docs/transformers/main/en/chat_response_parsing#response-parsing) for more details on response parsing."

sergiopaniego · 2025-11-25T16:14:12Z

Added agent training example script (trackio)!
400 steps to reach that point

… calculation to exclude tool tokens

qgallouedec · 2025-11-28T03:51:42Z

@codex review

qgallouedec · 2025-11-28T03:53:43Z

@codex review

albertvillanova

Just a minor change to fix #4609.

After the merge of huggingface/transformers#40936, the attribute does not necessarily exist. Before it was None.

Feel free to ignore it if there is a better solution!

albertvillanova · 2025-12-02T09:54:15Z

trl/trainer/grpo_trainer.py

+        # At the time of initial implementation, most tokenizers do not have built-in support for response schemas.
+        # While waiting for broader adoption, we provide this utility function to manually set the response schema for
+        # known chat templates.
+        if tools and not processing_class.response_schema:


To fix #4609:

Suggested change

if tools and not processing_class.response_schema:

if tools and not getattr(processing_class, "response_schema", None):

edbeeching · 2025-12-03T08:02:47Z

examples/scripts/grpo_agent.py

-        guess = completion[-1]["content"].strip()
+        guess = completion[-1]["content"].strip().lower()
+        guess_clean = guess.replace("*", "").replace("`", "").strip()
        reward = 0.0


Could L75-82 be simplified to:

if guess_clean == ans.lower(): reward = 0.5 else: reward = -0.2

edbeeching · 2025-12-03T08:11:36Z

examples/scripts/grpo_agent.py

+                if "error" in turn["content"].lower():
+                    reward -= 0.3  # penalize errors
+
+        if tool_called and tool_response_ok:


107-112 would be easier to parse for a reader like this:

if tool_called: if tool_response_ok: reward += 0.25 else: reward -= 0.2 else: reward -= 0.3

qgallouedec and others added 30 commits September 19, 2025 20:57

Refactor image handling: replace image_split_sizes with `image_grid…

552e899

…_thw` in GRPO and RLOO trainers; update `split_pixel_values_by_grid` to use `image_grid_thw`

simpler

449ef07

gfpo

c8933aa

multi-image grpo

229c554

log with wandb

3ca6ad5

no vlm reward models

dcf4b92

rloo

30ad7ca

gfpo

86cc30b

fix

088897b

test peft

d2adc63

fix gfpo

f4c82bf

rloo test

1257796

peft rloo

099a39b

oops

529add6

update test

fc6b11f

generate method

ae1f497

debug

f998432

skip failing test

fa73876

Merge branch 'main' into drop-image_split_sizes

52d8bd9

Merge branch 'drop-image_split_sizes' into multi-image-support

dfc0d38

test fixed!

fc52e68

Merge branch 'multi-image-support' into generate-method

4d12aeb

gfpo

4fc2b5b

rm vllm

b628744

fix doc

d3a769f

Merge branch 'main' into drop-image_split_sizes

e17ec42

Merge branch 'drop-image_split_sizes' into multi-image-support

efbb03a

Merge branch 'main' into multi-image-support

562c662

Merge branch 'main' into multi-image-support

485781c

update layers to ignore

05270f8

Merge branch 'main' into tool-call-finally

fb4c694

qgallouedec mentioned this pull request Nov 21, 2025

Make skip_special_tokens configurable #4521

Open

5 tasks

Merge branch 'main' into tool-call-finally

aa2615a

edbeeching reviewed Nov 24, 2025

View reviewed changes

trl/experimental/grpo_with_replay_buffer/grpo_with_replay_buffer_trainer.py Outdated Show resolved Hide resolved

edbeeching reviewed Nov 24, 2025

View reviewed changes

trl/trainer/grpo_trainer.py Show resolved Hide resolved

edbeeching reviewed Nov 24, 2025

View reviewed changes

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

edbeeching reviewed Nov 24, 2025

View reviewed changes

edbeeching approved these changes Nov 24, 2025

View reviewed changes

lewtun approved these changes Nov 24, 2025

View reviewed changes

qgallouedec and others added 6 commits November 24, 2025 21:54

Merge branch 'main' into tool-call-finally

c36ea41

flip tool mask

caf1ad2

isolate tool call loop

94c2ff2

Add example script

3cbb28e

code quality

6074ade

Update to more strict reward funcs

fc3d759

sergiopaniego and others added 3 commits November 25, 2025 17:14

Update steps

e37508d

Clarify token counting in reward metrics and adjust completion length…

af749c1

… calculation to exclude tool tokens

Updated example script with elaborated reward funcs

988efc1

Add example notebook and update docs

ce7d607

sergiopaniego mentioned this pull request Dec 1, 2025

CI fails with dev dependencies: ValueError: Could not load tokenizer #4599

Open

albertvillanova reviewed Dec 2, 2025

View reviewed changes

albertvillanova mentioned this pull request Dec 2, 2025

AttributeError: Qwen2Tokenizer has no attribute response_schema #4609

Open

Merge branch 'main' into tool-call-finally

6f65553

edbeeching reviewed Dec 3, 2025

View reviewed changes

	# Decode
	prompts_text = self.processing_class.batch_decode(prompt_ids, skip_special_tokens=True)
	completions_text = self.processing_class.batch_decode(completion_ids, skip_special_tokens=True)
	if is_conversational(inputs[0]):
	completions = []
	for prompt, completion in zip(prompts, completions_text, strict=True):
	bootstrap = prompt.pop()["content"] if prompt[-1]["role"] == "assistant" else ""
	if isinstance(bootstrap, list): # for VLM, the format might be [{"type": "text", "text": "..."}]
	assert len(bootstrap) == 1 and bootstrap[0]["type"] == "text"
	bootstrap = bootstrap[0]["text"]
	completions.append([{"role": "assistant", "content": bootstrap + completion}])
	else:
	completions = completions_text

	# Decode completions. It's important to use `parse_response` when possible, because it handles tool calls.
	if is_conversational({"prompt": prompts[0]}):
	if (
	Version(transformers.__version__) >= Version("5.0.0.dev0") # parse_response added in v5
	and isinstance(self.processing_class, PreTrainedTokenizerBase) # doesn't work with processors
	and self.processing_class.response_schema is not None # only works if the tokenizer has a schema
	):
	completions = [[parse_response(self.processing_class, ids)] for ids in completion_ids]
	else:
	contents = self.processing_class.batch_decode(completion_ids, skip_special_tokens=True)
	completions = [[{"role": "assistant", "content": content}] for content in contents]
	else:
	completions = self.processing_class.batch_decode(completion_ids, skip_special_tokens=True)

	# Add post-tool completions to the existing completions
	for idx in range(len(idxs_with_tool)):
	idx_with_tool = idxs_with_tool[idx]
	if post_tool_completions[idx]: # {} if post-tool completions completely truncated
	completions[idx_with_tool].append(post_tool_completions[idx])

	messages1 = [
	{"role": "user", "content": "What color is the sky?"},
	]
	messages2 = [
	{"role": "user", "content": "What color is the sky?"},
	{"role": "assistant", "content": "It is blue."},
	]
	messages3 = [
	{"role": "user", "content": "What color is the sky?"},
	{"role": "assistant", "content": "It is blue."},
	{"role": "user", "content": "And at night?"},
	]

	text1 = tokenizer.apply_chat_template(messages1, tokenize=False, add_generation_prompt=True)
	text2 = tokenizer.apply_chat_template(messages2, tokenize=False)
	text3 = tokenizer.apply_chat_template(messages3, tokenize=False)

	return text2.startswith(text1) and text3.startswith(text2)

	if tokenizer.chat_template == qwen3_chat_template:
	tokenizer.response_schema = qwen3_schema
	return tokenizer

	# Tool execution loop: execute tools, then regenerate completions with tool results appended to the prompt
	while idxs_with_tool:
	prompt_completion_tools = [prompts[i] for i in idxs_with_tool] # select only prompts that need tool calls

	# Tokenize the current prompt. We will use this to filter out overlong samples later.
	kwargs = {
	"tools": self.tools,
	"add_generation_prompt": True,
	"tokenize": True,
	"chat_template": self.chat_template,
	**self.chat_template_kwargs,
	}
	p_ids = self.processing_class.apply_chat_template(prompt_completion_tools, **kwargs)["input_ids"]

	# Call the tools, and build the new prompt for generation
	for idx in range(len(idxs_with_tool)):
	idx_with_tool = idxs_with_tool[idx]
	tool_call_list = tool_calls[idx]
	prompt_completion_tool = prompt_completion_tools[idx]
	prompt_completion_tool.append(completions[idx_with_tool][-1])
	for tool_call in tool_call_list:
	tool_call_count += 1
	if tool_call["type"] == "function":
	function = tool_call["function"]
	try:
	result = self._tool_dict[function["name"]](**function["arguments"])
	except Exception as e:
	result = {"error": str(e)}
	tool_failure_count += 1
	else:
	result = {"error": f"Unsupported tool call type: {tool_call['type']}"}
	tool_call["result"] = result
	tool_message = {"role": "tool", "name": function["name"], "content": str(result)}
	prompt_completion_tool.append(tool_message)
	completions[idx_with_tool].append(tool_message)

	# Tokenize and filter samples whose length exceeds max allowed length. This is important, because if vLLM
	# is called with an input longer than its max model length, it will error out.
	pct_ids = self.processing_class.apply_chat_template(prompt_completion_tools, **kwargs)["input_ids"]
	overlong = [len(pct) - len(p) >= self.max_completion_length for p, pct in zip(p_ids, pct_ids, strict=True)]
	for idx in range(len(idxs_with_tool)):
	idx_with_tool = idxs_with_tool[idx]
	if overlong[idx]:
	prompt_length = len(prompt_ids[idx_with_tool])
	ct = pct_ids[idx][prompt_length : prompt_length + self.max_completion_length]
	completion_ids[idx_with_tool] = ct
	tool_mask[idx_with_tool] += [0] * (len(ct) - len(tool_mask[idx_with_tool]))
	if logprobs is not None:
	logprobs[idx_with_tool] += [0.0] * (len(ct) - len(logprobs[idx_with_tool]))
	idxs_with_tool = [idx for idx, o in zip(idxs_with_tool, overlong, strict=True) if not o]
	prompt_completion_tools = [pct for pct, o in zip(prompt_completion_tools, overlong, strict=True) if not o]
	if not idxs_with_tool:
	break # all overlong, exit tool loop

	# Generate new completions after tool execution
	prompt_completion_tool_ids, post_tool_ids, post_tool_logprobs, _ = self._generate_single_turn(
	prompt_completion_tools
	)

	# Sanity check: from experience, this is useful to catch bugs in the chat template
	for idx in range(len(idxs_with_tool)):
	idx_with_tool = idxs_with_tool[idx]
	pct = prompt_completion_tool_ids[idx] # = prompt-completion-tool
	assert prompt_ids[idx_with_tool] == pct[: len(prompt_ids[idx_with_tool])]

	# Truncate so that pct[len(prompt_ids[idx]) :] + post_tool does not exceed max_completion_length
	for idx in range(len(idxs_with_tool)):
	idx_with_tool = idxs_with_tool[idx]
	prompt_len = len(prompt_ids[idx_with_tool])
	completion_tool_ids = prompt_completion_tool_ids[idx][prompt_len:]
	excess_length = len(completion_tool_ids) + len(post_tool_ids[idx]) - self.max_completion_length
	if excess_length > 0:
	# If exceeding max length, truncate post_tool_ids
	post_tool_ids[idx] = post_tool_ids[idx][:-excess_length]
	if logprobs is not None:
	post_tool_logprobs[idx] = post_tool_logprobs[idx][:-excess_length]
	excess_length = len(completion_tool_ids) + len(post_tool_ids[idx]) - self.max_completion_length
	if excess_length > 0:
	# If still exceeding max length, truncate completion_tool_ids as well
	prompt_completion_tool_ids[idx] = prompt_completion_tool_ids[idx][:-excess_length]

	# Update tool_mask: the tool result should be 1 and the post-tool 0
	for idx in range(len(idxs_with_tool)):
	idx_with_tool = idxs_with_tool[idx]
	prompt_completion_tool_length = len(prompt_completion_tool_ids[idx])
	prompt_length = len(prompt_ids[idx_with_tool])
	completion_length = len(completion_ids[idx_with_tool])
	post_tool_length = len(post_tool_ids[idx])
	tool_length = prompt_completion_tool_length - prompt_length - completion_length
	tool_mask[idx_with_tool] += [1] * tool_length + [0] * post_tool_length
	if logprobs is not None:
	logprobs[idx_with_tool] += [0.0] * tool_length + post_tool_logprobs[idx]

	# Update completion_ids with the new completions (after tool execution)
	for idx in range(len(idxs_with_tool)):
	idx_with_tool = idxs_with_tool[idx]
	prompt_length = len(prompt_ids[idx_with_tool])
	pct = prompt_completion_tool_ids[idx] # = prompt-completion-tool
	completion_ids[idx_with_tool] = pct[prompt_length:] + post_tool_ids[idx]

	# Decode post-tool completions
	post_tool_completions = [
	parse_response(self.processing_class, ids) if ids else {} for ids in post_tool_ids
	]

	# Add post-tool completions to the existing completions
	for idx in range(len(idxs_with_tool)):
	idx_with_tool = idxs_with_tool[idx]
	if post_tool_completions[idx]: # {} if post-tool completions completely truncated
	completions[idx_with_tool].append(post_tool_completions[idx])

	# Check for further tool calls
	tool_calls = [completion.get("tool_calls") for completion in post_tool_completions]
	idxs_with_tool = [idx for idx, tool_call in zip(idxs_with_tool, tool_calls, strict=True) if tool_call]
	tool_calls = [tool_call for tool_call in tool_calls if tool_call]

	def test_training_with_tools(self):
	# In this test, we define a simple tool that multiplies two integers. Regardless of the input prompt,
	# the model will generate 3 completions, 2 of which will be valid tool calls. Among the 2 tool calls, one will
	# succeed and the other will fail (because of a wrong argument name).
	def multiply(a: int, b: int) -> int:
	"""
	Multiplies two integers.

	Args:
	a: The first integer.
	b: The second integer.

	Returns:
	The product of the two integers.
	"""
	return a * b

	dataset = load_dataset("trl-internal-testing/zen", "conversational_prompt_only", split="train")

	training_args = GRPOConfig(
	output_dir=self.tmp_dir,
	learning_rate=0.1,
	per_device_train_batch_size=3,
	num_generations=3,
	max_completion_length=128,
	report_to="none",
	)
	trainer = GRPOTrainer(
	model="trl-internal-testing/tiny-Qwen3MoeForCausalLM",
	reward_funcs="trl-internal-testing/tiny-Qwen2ForSequenceClassification-2.5",
	args=training_args,
	train_dataset=dataset,
	tools=[multiply],
	)

	previous_trainable_params = {n: param.clone() for n, param in trainer.model.named_parameters()}

	def fake_generate(input_ids, **kwargs):
	if input_ids.shape[0] == 3: # first call
	# fmt: off
	completion_ids = torch.tensor(
	[
	# '<tool_call>\n{"name": "multiply", "arguments": {"a": 3, "b": 4}}\n</tool_call><\|im_end\|>'
	[151657, 198, 4913, 606, 788, 330, 64648, 497, 330, 16370, 788, 5212, 64, 788, 220, 18, 11, 330, 65, 788, 220, 19, 11248, 151658, 151645],
	# '<tool_call>\n{"name": "multiply", "arguments": {"a": 3, "c": 4}}\n</tool_call><\|im_end\|>'
	[151657, 198, 4913, 606, 788, 330, 64648, 497, 330, 16370, 788, 5212, 64, 788, 220, 18, 11, 330, 66, 788, 220, 19, 11248, 151658, 151645],
	# "I don't know any tool<\|im_end\|>"
	[40, 1513, 944, 1414, 894, 5392, 151645, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643],
	],
	device=input_ids.device,
	)
	# fmt: on
	else: # second call will only have two inputs in the batch, because two examples have a tool call.
	completion_ids = torch.tensor(
	[
	# 'Done!<\|im_end\|>'
	[17453, 0, 151645],
	# 'Done!<\|im_end\|>'
	[17453, 0, 151645],
	],
	device=input_ids.device,
	)
	return torch.cat([input_ids, completion_ids], dim=-1)

	with patch.object(trainer.model, "generate", side_effect=fake_generate):
	trainer.train()

	assert trainer.state.log_history[-1]["train_loss"] is not None
	assert trainer.state.log_history[-1]["tools/call_frequency"] is not None
	assert trainer.state.log_history[-1]["tools/call_frequency"] == pytest.approx(2 / 3)
	assert trainer.state.log_history[-1]["tools/failure_frequency"] is not None
	assert trainer.state.log_history[-1]["tools/failure_frequency"] == pytest.approx(1 / 2)

	# Check that the params have changed
	for n, param in previous_trainable_params.items():
	new_param = trainer.model.get_parameter(n)
	assert not torch.equal(param, new_param), f"Parameter {n} has not changed."

	"tokenizer or processor."
	"tokenizer or processor. See the Transformers [docs](https://huggingface.co/docs/transformers/main/en/chat_response_parsing#response-parsing) for more details on response parsing."

	if tools and not processing_class.response_schema:
	if tools and not getattr(processing_class, "response_schema", None):

🕵️‍♂️ Agent training #4300

Are you sure you want to change the base?

🕵️‍♂️ Agent training #4300

Uh oh!

Conversation

qgallouedec commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

🚨 Removal of max_prompt_length

Online decoding

Tool mask

Schema and fixed chat template

Chat template

Response schema

A fair amount of complexity in the generation

Next steps

Uh oh!

qgallouedec commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

edbeeching Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

edbeeching Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edbeeching left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lewtun left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sergiopaniego commented Nov 25, 2025

Uh oh!

qgallouedec commented Nov 28, 2025

Uh oh!

qgallouedec commented Nov 28, 2025

Uh oh!

albertvillanova left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edbeeching Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

qgallouedec commented Oct 18, 2025 •

edited

Loading

🚨 Removal of `max_prompt_length`

qgallouedec commented Nov 20, 2025 •

edited

Loading

edbeeching Nov 24, 2025 •

edited

Loading

edbeeching Nov 24, 2025 •

edited

Loading

edbeeching left a comment •

edited

Loading

albertvillanova left a comment •

edited

Loading

edbeeching Dec 3, 2025 •

edited

Loading