Skip to content
Open
Changes from 3 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
494657f
add 10 papers (+trainer cross-links) for #4407
SSusantAchary Nov 3, 2025
70a26e6
Add links to GRPOTrainer and CPOTrainer in paper index
SSusantAchary Nov 3, 2025
85054a4
Merge branch 'main' into Paper-Index-with-10-papers
SSusantAchary Nov 3, 2025
2e01916
Clean up paper_index.md
SSusantAchary Nov 3, 2025
4e2f10b
Add DeepSeekMath section with GRPO configuration
SSusantAchary Nov 3, 2025
0cf0649
Enhance paper index with code examples and updates
SSusantAchary Nov 3, 2025
be590d6
Merge branch 'main' into Paper-Index-with-10-papers
SSusantAchary Nov 3, 2025
5f2e5e9
Add PEFT section with LoRA implementation example
SSusantAchary Nov 3, 2025
597a397
Remove beta parameter from GRPOConfig
SSusantAchary Nov 6, 2025
f1d98b0
Refactor training_args by removing unused parameters
SSusantAchary Nov 6, 2025
9348298
Add Direct Policy Optimization section to paper index
SSusantAchary Nov 6, 2025
a9d9467
Change loss_type from 'sigmoid' to 'hinged'
SSusantAchary Nov 6, 2025
66cc83d
Revise RSO example with model loading and DPO config
SSusantAchary Nov 6, 2025
e35c02d
removed explicit defaults from DPOConfig (loss_type, beta) per review
SSusantAchary Nov 6, 2025
98d3086
simplify example; rely on defaults and keep only essential PEFT task_…
SSusantAchary Nov 6, 2025
627db69
Merge branch 'main' into Paper-Index-with-10-papers
SSusantAchary Nov 6, 2025
33f7f42
Merge branch 'main' into Paper-Index-with-10-papers
SSusantAchary Nov 7, 2025
1efce43
Merge branch 'main' into Paper-Index-with-10-papers
SSusantAchary Nov 7, 2025
e50ed53
Merge branch 'main' into Paper-Index-with-10-papers
SSusantAchary Nov 9, 2025
0856286
Merge branch 'main' into Paper-Index-with-10-papers
SSusantAchary Nov 11, 2025
e0e0e39
Merge branch 'main' into Paper-Index-with-10-papers
SSusantAchary Nov 14, 2025
f204d80
Merge branch 'main' into Paper-Index-with-10-papers
SSusantAchary Nov 15, 2025
a7872c3
Merge branch 'main' into Paper-Index-with-10-papers
SSusantAchary Nov 22, 2025
d428d41
Merge branch 'main' into Paper-Index-with-10-papers
SSusantAchary Nov 23, 2025
ab26bcc
Merge branch 'main' into Paper-Index-with-10-papers
SSusantAchary Nov 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 59 additions & 0 deletions docs/source/paper_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -231,6 +231,11 @@ trainer = PAPOTrainer(
...
)
```
### DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
**πŸ“œ Paper**: https://huggingface.co/papers/2402.03300

Introduces **GRPO** and shows strong math-reasoning gains from math-centric pretraining plus group-relative PPO-style optimization.
**Used in TRL via:** [`GRPOTrainer`]

## Direct Policy Optimization

Expand Down Expand Up @@ -457,6 +462,21 @@ training_args = DPOConfig(

These parameters only appear in the [published version](https://aclanthology.org/2025.tacl-1.22.pdf)

### Statistical Rejection Sampling Improves Preference Optimization
**πŸ“œ Paper**: https://huggingface.co/papers/2309.06657

Proposes **RSO**, selecting stronger preference pairs via statistical rejection sampling to boost offline preference optimization; complements DPO/SLiC.

### Nash Learning from Human Feedback
**πŸ“œ Paper**: https://huggingface.co/papers/2312.00886

Frames alignment as a **two-player game**, learning Nash policies from human feedback; connects to multi-objective and competitive preference training.

### Direct Language Model Alignment from Online AI Feedback
**πŸ“œ Paper**: https://huggingface.co/papers/2402.04792

Uses **online AI feedback (OAIF)** to supply real-time preference signals, improving direct alignment beyond purely offline pairs.

## Supervised Fine-Tuning

Papers relating to the [`SFTTrainer`]
Expand Down Expand Up @@ -509,6 +529,10 @@ SFTConfig(
gradient_accumulation_steps=32,
)
```
### LoRA: Low-Rank Adaptation of Large Language Models
**πŸ“œ Paper**: https://huggingface.co/papers/2106.09685

Parameter-efficient fine-tuning via **low-rank adapters**, cutting trainable parameters and memory while preserving quality (see PEFT integration in TRL).

## Reinforce Leave-One-Out

Expand Down Expand Up @@ -555,6 +579,11 @@ training_args = CPOConfig(
...
)
```
### Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation
**πŸ“œ Paper**: https://huggingface.co/papers/2401.08417

Trains models to **avoid adequate but sub-optimal outputs** using contrastive pairs; improves 7B–13B MT models to SOTA.
**Used in TRL via:** [`CPOTrainer`]

## Reward Modeling

Expand Down Expand Up @@ -605,3 +634,33 @@ def add_margin(example):

dataset = dataset.map(add_margin)
```
### Solving Math Word Problems with Process- and Outcome-Based Feedback
**πŸ“œ Paper**: https://huggingface.co/papers/2211.14275

Shows benefits of **process supervision** (step-level) alongside outcome labels for math reasoning, motivating richer feedback signals for alignment.


## Distillation / Post-training (Background)

### On-Policy Distillation of Language Models
**πŸ“œ Paper**: https://huggingface.co/papers/2306.13649

Introduces **GKD**, aligning student with teacher **on-policy** to stabilize/boost instruction tuning and integrate cleanly with RLHF pipelines.

## Foundations & Systems (Background)

### Proximal Policy Optimization Algorithms
**πŸ“œ Paper**: https://huggingface.co/papers/1707.06347

Foundational **PPO** objective with clipped ratios and minibatch epochsβ€”baseline for many RL/RLHF variants used in TRL.

### ZeRO: Memory Optimizations Toward Training Trillion-Parameter Models
**πŸ“œ Paper**: https://huggingface.co/papers/1910.02054

**ZeRO** partitions optimizer states/gradients/params to scale training efficiently; relevant when configuring DeepSpeed/Accelerate with TRL.


[`GRPOTrainer`]: grpo_trainer
[`CPOTrainer`]: cpo_trainer