Add TRL GRPO Reasoning with Advanced Reward notebook #319

behroozazarkhalili · 2025-07-26T17:23:27Z

Summary

This notebook demonstrates advanced GRPO (Group Relative Policy Optimization) fine-tuning for mathematical reasoning using a comprehensive multi-reward training system on the GSM8K dataset.

Key Features

4 Specialized Reward Functions: Format compliance, approximate matching, answer correctness, and number extraction
Memory Efficient Training: 4-bit quantization + LoRA for consumer GPUs
Interactive Experiment Tracking: Real-time training metrics with trackio dashboard
Structured Output Generation: Enforces step-by-step reasoning format with validation
Comprehensive Resource Management: GPU memory optimization and experiment cleanup
Production-Ready Code: Clean, well-documented, educational content

Technical Improvements

Streamlined content organization with concise, action-oriented instructions
Enhanced inline comments explaining technical decisions and implementation details
Optimized training parameters specifically for mathematical reasoning tasks
Comprehensive model evaluation with structured output validation
Timestamp-based unique run naming for experiment session separation
Proper logging configuration to suppress verbose HTTP request logs

Requirements Checklist

Recent Updates

Complete notebook enhancement with comprehensive improvements
Added trackio experiment tracking with proper cleanup functionality
Streamlined all 38 cells for clarity and educational value
Enhanced resource management and GPU memory optimization
Improved model testing and evaluation sections
Removed decorative elements and simplified output formatting
Branch rebased onto latest main branch - ready for clean merge

This represents the final polished version ready for production use, incorporating all reviewer feedback and implementing best practices for educational content, technical accuracy, and resource management.

@merveenoyan @stevhliu @qgallouedec @sergiopaniego

Contributed by: Behrooz Azarkhalili

review-notebook-app · 2025-07-26T17:23:32Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb

sergiopaniego

Thanks for the addition! 😄
We already have a pretty similar example "Post training an LLM for reasoning with GRPO in TRL".
The idea of the repo is to have end-to-end recipes with extended explanations, so I'd suggest:

Extending the explanations throughout the recipe of the example.
Link the previous example and make a clear distinction between them, explaining it at the beginning. Otherwise, it could lead to confusion for a possible reader looking for an example of GRPO.

The recipes can be opened in Colab and maybe run, so I'd also be nice to keep that in mind. For example when doing os.environ["CUDA_VISIBLE_DEVICES"] = "1" since in Colab there is only 1 GPU.

sergiopaniego · 2025-07-29T12:28:06Z

notebooks/en/index.md

@@ -7,6 +7,7 @@ applications and solving various machine learning tasks using open-source tools

 Check out the recently added notebooks:

+- [TRL GRPO Reasoning with Advanced Reward](trl_grpo_reasoning_advanced_reward)


You can remove the last entry since we aim to have the last 5 here.

Hi @sergiopaniego, I just added the notes you mentioned. I hope the extension and the differences between the two versions make sense now! 😊

sergiopaniego · 2025-08-11T09:12:48Z

notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb

@@ -0,0 +1,1452 @@
+{


I'd suggest some possible ideas for improving this section

I'd reduce this section since it contains too much text. Instead, you can distribute the ideas where they're more suitable. For example, explaining the rewards functions in the section where you introduce them.
I'd remove the section and subsections and only keep the title. If you want to add some relevant information, you can consider using bold style.
The comparison against the other example contains so problems. For example, in the other example we have two reward functions but here you say it's only one. I'd suggest reviewing that.

Reply via ReviewNB

sergiopaniego · 2025-08-11T09:12:48Z

notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb

@@ -0,0 +1,1452 @@
+{


Code blocks include a lot of code without explanation. I'd suggest dividing them into meaningful subblocks and add some explanation. Let's think about the target audience (learner) :)

Reply via ReviewNB

sergiopaniego · 2025-08-11T09:12:48Z

notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb

@@ -0,0 +1,1452 @@
+{


We could link the dataset in the Hub here so the reader could explore more

Reply via ReviewNB

sergiopaniego · 2025-08-11T09:12:48Z

notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb

@@ -0,0 +1,1452 @@
+{


I'd be nice if we could reduce this blocks a little, since they contain a lot of details. Are all the hyperparams needed?

Reply via ReviewNB

sergiopaniego · 2025-08-11T09:12:48Z

notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb

@@ -0,0 +1,1452 @@
+{


We should explaining throughout the notebook the decisions made. Why do we need a callback (always think of a possible reader/learner)?

Reply via ReviewNB

sergiopaniego

Could you also resolve the conflicts with main? 😄

This notebook demonstrates how to use TRL (Transformers Reinforcement Learning) with GRPO (Group Relative Policy Optimization) for reasoning tasks with advanced reward mechanisms. - Added notebook with proper lowercase filename - Updated _toctree.yml and index.md - Added proper author attribution - Cleaned non-informative outputs Contributed by: Behrooz Azarkhalili

- Remove torch and accelerate from installation (dependencies of TRL) - Remove pad token check (handled automatically) - Restore num_generations to default value (8) - Remove remove_unused_columns parameter (false by default) - Remove processing_class parameter (loaded automatically)

…O recipe - Add direct link to existing HuggingFace GRPO cookbook example - Fix CUDA device setting for Colab compatibility (auto-detect instead of hardcoded) - Add comprehensive explanations throughout all recipe sections - Enhance with detailed comparison table showing differences from basic example - Improve GPU setup with memory information and fallback instructions - Add detailed LoRA configuration explanations and parameter analysis - Expand dataset preparation with GSM8K background and format details - Detail multi-reward system design for mathematical reasoning approach - Optimize training configuration with Colab-specific memory settings - Enhance testing and evaluation with detailed response analysis - Make notebook fully end-to-end recipe focused for cookbook standards - Address all reviewer feedback comprehensively for cookbook contribution

…anup Major improvements to GRPO mathematical reasoning notebook: Content Organization: - Streamlined introduction removing verbose explanations - Simplified installation and setup sections with clear instructions - Updated all markdown cells to be concise and action-oriented - Improved inline comments to explain technical decisions and "why" behind code Technical Enhancements: - Added trackio experiment tracking with comprehensive configuration - Implemented timestamp-based unique run naming for session separation - Enhanced logging configuration to suppress verbose HTTP request logs - Optimized training parameters for mathematical reasoning tasks - Improved model evaluation section with structured output validation Code Quality: - Clean, consistent formatting across all 38 cells - Removed decorative print statements and emojis from evaluation section - Added proper error handling and documentation - Streamlined resource management and GPU memory optimization Resource Management: - Added remove_trackio_project() function for database cleanup - Comprehensive cleanup section with storage management - Warning comments about permanent data deletion - Proper resource freeing with GPU cache clearing Testing and Validation: - Enhanced model testing with optimized generation parameters - Improved format compliance checking with detailed validation - Better answer accuracy verification with extraction methods - Comprehensive response analysis and debugging output This represents the final polished version ready for production use, incorporating all previous feedback and implementing best practices for educational content, technical accuracy, and resource management.

qgallouedec reviewed Jul 26, 2025

View reviewed changes

sergiopaniego reviewed Jul 29, 2025

View reviewed changes

sergiopaniego reviewed Aug 11, 2025

View reviewed changes

sergiopaniego mentioned this pull request Aug 11, 2025

Add Function Calling Fine-tuning LLMs on xLAM Dataset notebook #321

Open

11 tasks

behroozazarkhalili added 4 commits August 23, 2025 18:38

behroozazarkhalili force-pushed the add-grpo-advanced-reward-notebook branch from e6e5cbb to 72a5d43 Compare August 24, 2025 01:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add TRL GRPO Reasoning with Advanced Reward notebook #319

Add TRL GRPO Reasoning with Advanced Reward notebook #319

Uh oh!

behroozazarkhalili commented Jul 26, 2025 •

edited

Loading

Uh oh!

review-notebook-app bot commented Jul 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sergiopaniego left a comment

Uh oh!

sergiopaniego Jul 29, 2025

Uh oh!

behroozazarkhalili Jul 29, 2025 •

edited

Loading

Uh oh!

sergiopaniego Aug 11, 2025 •

edited

Loading

Uh oh!

sergiopaniego Aug 11, 2025 •

edited

Loading

Uh oh!

sergiopaniego Aug 11, 2025 •

edited

Loading

Uh oh!

sergiopaniego Aug 11, 2025 •

edited

Loading

Uh oh!

sergiopaniego Aug 11, 2025 •

edited

Loading

Uh oh!

sergiopaniego left a comment

Uh oh!

Uh oh!

		@@ -7,6 +7,7 @@ applications and solving various machine learning tasks using open-source tools

		Check out the recently added notebooks:

		- [TRL GRPO Reasoning with Advanced Reward](trl_grpo_reasoning_advanced_reward)

Add TRL GRPO Reasoning with Advanced Reward notebook #319

Are you sure you want to change the base?

Add TRL GRPO Reasoning with Advanced Reward notebook #319

Uh oh!

Conversation

behroozazarkhalili commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

Technical Improvements

Requirements Checklist

Recent Updates

Uh oh!

review-notebook-app bot commented Jul 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sergiopaniego left a comment

Choose a reason for hiding this comment

Uh oh!

sergiopaniego Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

behroozazarkhalili Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sergiopaniego Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sergiopaniego Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sergiopaniego Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sergiopaniego Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sergiopaniego Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sergiopaniego left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

behroozazarkhalili commented Jul 26, 2025 •

edited

Loading

behroozazarkhalili Jul 29, 2025 •

edited

Loading

sergiopaniego Aug 11, 2025 •

edited

Loading

sergiopaniego Aug 11, 2025 •

edited

Loading

sergiopaniego Aug 11, 2025 •

edited

Loading

sergiopaniego Aug 11, 2025 •

edited

Loading

sergiopaniego Aug 11, 2025 •

edited

Loading