Skip to content
Open
Changes from 8 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
494657f
add 10 papers (+trainer cross-links) for #4407
SSusantAchary Nov 3, 2025
70a26e6
Add links to GRPOTrainer and CPOTrainer in paper index
SSusantAchary Nov 3, 2025
85054a4
Merge branch 'main' into Paper-Index-with-10-papers
SSusantAchary Nov 3, 2025
2e01916
Clean up paper_index.md
SSusantAchary Nov 3, 2025
4e2f10b
Add DeepSeekMath section with GRPO configuration
SSusantAchary Nov 3, 2025
0cf0649
Enhance paper index with code examples and updates
SSusantAchary Nov 3, 2025
be590d6
Merge branch 'main' into Paper-Index-with-10-papers
SSusantAchary Nov 3, 2025
5f2e5e9
Add PEFT section with LoRA implementation example
SSusantAchary Nov 3, 2025
597a397
Remove beta parameter from GRPOConfig
SSusantAchary Nov 6, 2025
f1d98b0
Refactor training_args by removing unused parameters
SSusantAchary Nov 6, 2025
9348298
Add Direct Policy Optimization section to paper index
SSusantAchary Nov 6, 2025
a9d9467
Change loss_type from 'sigmoid' to 'hinged'
SSusantAchary Nov 6, 2025
66cc83d
Revise RSO example with model loading and DPO config
SSusantAchary Nov 6, 2025
e35c02d
removed explicit defaults from DPOConfig (loss_type, beta) per review
SSusantAchary Nov 6, 2025
98d3086
simplify example; rely on defaults and keep only essential PEFT task_…
SSusantAchary Nov 6, 2025
627db69
Merge branch 'main' into Paper-Index-with-10-papers
SSusantAchary Nov 6, 2025
33f7f42
Merge branch 'main' into Paper-Index-with-10-papers
SSusantAchary Nov 7, 2025
1efce43
Merge branch 'main' into Paper-Index-with-10-papers
SSusantAchary Nov 7, 2025
e50ed53
Merge branch 'main' into Paper-Index-with-10-papers
SSusantAchary Nov 9, 2025
0856286
Merge branch 'main' into Paper-Index-with-10-papers
SSusantAchary Nov 11, 2025
e0e0e39
Merge branch 'main' into Paper-Index-with-10-papers
SSusantAchary Nov 14, 2025
f204d80
Merge branch 'main' into Paper-Index-with-10-papers
SSusantAchary Nov 15, 2025
a7872c3
Merge branch 'main' into Paper-Index-with-10-papers
SSusantAchary Nov 22, 2025
d428d41
Merge branch 'main' into Paper-Index-with-10-papers
SSusantAchary Nov 23, 2025
ab26bcc
Merge branch 'main' into Paper-Index-with-10-papers
SSusantAchary Nov 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
254 changes: 249 additions & 5 deletions docs/source/paper_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,28 @@

## Group Relative Policy Optimization

Papers relating to the [`GRPOTrainer`]
### DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
**πŸ“œ Paper**: https://huggingface.co/papers/2402.03300

Introduces **GRPO** and shows strong math-reasoning gains from math-centric pretraining plus group-relative PPO-style optimization.
**Used in TRL via:** [`GRPOTrainer`]

```python
# Minimal GRPO setup (mirrors style used for other papers on the page).
from trl import GRPOConfig

training_args = GRPOConfig(
loss_type="grpo",
beta=0.0, # GRPO commonly trains without explicit KL in released configs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original paper they don't use beta=0.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

epsilon=2e-4, # clip range (use paper/experiment settings if you mirror them)
epsilon_high=4e-4, # upper clip (symmetrical if not specified)
steps_per_generation=4, # sample multiple completions per prompt
gradient_accumulation_steps=1,
num_generations=8, # completions per prompt (adjust to your compute)
max_prompt_length=1024,
max_completion_length=1024,
)
```
### Group Sequence Policy Optimization

**πŸ“œ Paper**: https://huggingface.co/papers/2507.18071
Expand Down Expand Up @@ -232,10 +252,6 @@ trainer = PAPOTrainer(
)
```

## Direct Policy Optimization

Papers relating to the [`DPOTrainer`]

### Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model

**πŸ“œ Paper**: https://huggingface.co/papers/2305.18290
Expand Down Expand Up @@ -457,6 +473,74 @@ training_args = DPOConfig(

These parameters only appear in the [published version](https://aclanthology.org/2025.tacl-1.22.pdf)

### Statistical Rejection Sampling Improves Preference Optimization
**πŸ“œ Paper**: https://huggingface.co/papers/2309.06657

Proposes **RSO**, selecting stronger preference pairs via statistical rejection sampling to boost offline preference optimization; complements DPO/SLiC.
```python
# Curate DPO pairs with rejection sampling BEFORE training
from datasets import Dataset
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer

def rso_accept(ex): # replace with your statistic (gap / z-score / judge score)
return ex.get("rso_keep", True)

dpo_pairs = dpo_pairs.filter(rso_accept)

model = AutoModelForCausalLM.from_pretrained("..."); tok = AutoTokenizer.from_pretrained("...")
args = DPOConfig(loss_type="sigmoid", beta=0.1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't the loss supposed to be "hinged"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

trainer = DPOTrainer(model=model, args=args, tokenizer=tok, train_dataset=dpo_pairs)
trainer.train()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from datasets import Dataset
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
def rso_accept(ex): # replace with your statistic (gap / z-score / judge score)
return ex.get("rso_keep", True)
dpo_pairs = dpo_pairs.filter(rso_accept)
model = AutoModelForCausalLM.from_pretrained("..."); tok = AutoTokenizer.from_pretrained("...")
args = DPOConfig(loss_type="sigmoid", beta=0.1)
trainer = DPOTrainer(model=model, args=args, tokenizer=tok, train_dataset=dpo_pairs)
trainer.train()
from datasets import load_dataset
from trl import DPOConfig, DPOTrainer
train_dataset = load_dataset(...)
def rso_accept(example): # replace with your statistic (gap / z-score / judge score)
return example.get("rso_keep", True)
train_dataset = train_dataset.filter(rso_accept)
training_args = DPOConfig(loss_type="sigmoid", beta=0.1)
trainer = DPOTrainer(
...,
args=training_args,
train_dataset=train_dataset
)
trainer.train()

for consistency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minimal

```

### Nash Learning from Human Feedback
**πŸ“œ Paper**: https://huggingface.co/papers/2312.00886

Frames alignment as a **two-player game**, learning Nash policies from human feedback; connects to multi-objective and competitive preference training.
```python
# Train a DPO policy (one side of the game-theoretic setup)
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("..."); tok = AutoTokenizer.from_pretrained("...")
args = DPOConfig(loss_type="sigmoid", beta=0.1)
trainer = DPOTrainer(model=model, args=args, tokenizer=tok, train_dataset=...)
trainer.train()

```


### Direct Language Model Alignment from Online AI Feedback
**πŸ“œ Paper**: https://huggingface.co/papers/2402.04792

Uses **online AI feedback (OAIF)** to supply real-time preference signals, improving direct alignment beyond purely offline pairs.
```python
# Sketch: collect online AI feedback -> extend DPO dataset -> train
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer

policy = AutoModelForCausalLM.from_pretrained("..."); tok = AutoTokenizer.from_pretrained("...")

def ai_feedback_score(prompt, response) -> float:
# plug a judge / RM / heuristic here (return scalar)
pass

new_pairs = []
for ex in prompts_ds:
out = policy.generate(**tok(ex["prompt"], return_tensors="pt").to(policy.device), max_new_tokens=256)
resp = tok.decode(out[0], skip_special_tokens=True)
score = ai_feedback_score(ex["prompt"], resp)
# build a (chosen, rejected) pair using score (e.g., compare vs baseline response)
new_pairs.append({"prompt": ex["prompt"], "chosen": resp, "rejected": ex["baseline"]})

augmented_pairs = dpo_pairs.add_items(new_pairs)

args = DPOConfig(loss_type="sigmoid", beta=0.1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there no need to pass the value if it matches the default. In other words, remove any occurence of like loss_type="sigmoid" or beta=0.1

trainer = DPOTrainer(model=policy, args=args, tokenizer=tok, train_dataset=augmented_pairs)
trainer.train()
```

## Supervised Fine-Tuning

Papers relating to the [`SFTTrainer`]
Expand Down Expand Up @@ -508,6 +592,50 @@ SFTConfig(
per_device_train_batch_size=8,
gradient_accumulation_steps=32,
)
```
## Parameter-Efficient Fine-Tuning (PEFT)

### LoRA: Low-Rank Adaptation of Large Language Models
**πŸ“œ Paper**: https://huggingface.co/papers/2106.09685

Parameter-efficient fine-tuning via **low-rank adapters**, cutting trainable parameters and memory while preserving quality (see PEFT integration in TRL).
```python
# LoRA adapters with SFT (works the same for DPO/GRPO by passing peft_config to those trainers)
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig

model_id = "meta-llama/Llama-3.1-8B-Instruct" # any causal LM on HF Hub
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

peft_cfg = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
# common modules for LLaMA/Mistral/Qwen/Gemma; adjust per model if needed
target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
)

args = SFTConfig(
max_seq_length=2048,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
learning_rate=2e-4,
bf16=True,
)

trainer = SFTTrainer(
model=model,
args=args,
tokenizer=tok,
peft_config=peft_cfg, # <- LoRA enabled
train_dataset=...,
)
trainer.train()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# LoRA adapters with SFT (works the same for DPO/GRPO by passing peft_config to those trainers)
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
model_id = "meta-llama/Llama-3.1-8B-Instruct" # any causal LM on HF Hub
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
peft_cfg = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
# common modules for LLaMA/Mistral/Qwen/Gemma; adjust per model if needed
target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
)
args = SFTConfig(
max_seq_length=2048,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
learning_rate=2e-4,
bf16=True,
)
trainer = SFTTrainer(
model=model,
args=args,
tokenizer=tok,
peft_config=peft_cfg, # <- LoRA enabled
train_dataset=...,
)
trainer.train()
from peft import LoraConfig
from trl import SFTTrainer
trainer = SFTTrainer(
...,
peft_config=LoraConfig(),
)

the more minimal, the clearer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

```

## Reinforce Leave-One-Out
Expand All @@ -534,7 +662,27 @@ training_args = RLOOConfig(

## Contrastive Preference Optimization

### Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation
**πŸ“œ Paper**: https://huggingface.co/papers/2401.08417

Trains models to **avoid adequate but sub-optimal outputs** using contrastive pairs; improves 7B–13B MT models to SOTA.
**Used in TRL via:** [`CPOTrainer`]

Papers relating to the [`CPOTrainer`]
```python
from trl import CPOConfig, CPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("..."); tok = AutoTokenizer.from_pretrained("...")
args = CPOConfig(
loss_type="cpo", # default CPO loss in TRL
simpo_gamma=0.1, # optional: leverage SIMPO-style margining if desired
beta=0.05, # KL-like regularization if applicable
per_device_train_batch_size=8,
)
trainer = CPOTrainer(model=model, args=args, tokenizer=tok, train_dataset=contrastive_pairs)
trainer.train()
```

### AlphaPO -- Reward shape matters for LLM alignment

Expand Down Expand Up @@ -605,3 +753,99 @@ def add_margin(example):

dataset = dataset.map(add_margin)
```
### Solving Math Word Problems with Process- and Outcome-Based Feedback
**πŸ“œ Paper**: https://huggingface.co/papers/2211.14275

Shows benefits of **process supervision** (step-level) alongside outcome labels for math reasoning, motivating richer feedback signals for alignment.
```python
# Combine step-level (process) and final-answer (outcome) rewards
from trl import GRPOConfig, GRPOTrainer

def process_reward(sample) -> float:
# e.g., +1 for each verified-correct reasoning step (paper-specific)
return sample.get("num_correct_steps", 0) / max(1, sample.get("num_steps", 1))

def outcome_reward(sample) -> float:
return 1.0 if sample.get("is_correct") else 0.0

def fused_reward(sample) -> float:
return 0.5 * process_reward(sample) + 0.5 * outcome_reward(sample)

args = GRPOConfig(loss_type="grpo", beta=0.0, steps_per_generation=4, num_generations=4)
trainer = GRPOTrainer(model=..., args=args, tokenizer=..., train_dataset=..., reward_funcs=[fused_reward])
trainer.train()

```

## Distillation / Post-training (Background)

### On-Policy Distillation of Language Models
**πŸ“œ Paper**: https://huggingface.co/papers/2306.13649

Introduces **GKD**, aligning student with teacher **on-policy** to stabilize/boost instruction tuning and integrate cleanly with RLHF pipelines.
```python
# 1) Have the TEACHER generate outputs on current prompts (on-policy)
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig

teacher = AutoModelForCausalLM.from_pretrained("teacher-model"); tok = AutoTokenizer.from_pretrained("teacher-model")
def teacher_label(prompt):
out = teacher.generate(**tok(prompt, return_tensors="pt").to(teacher.device), max_new_tokens=256)
return tok.decode(out[0], skip_special_tokens=True)

distill_ds = prompts_ds.map(lambda ex: {"prompt": ex["prompt"], "response": teacher_label(ex["prompt"])})

# 2) Train STUDENT with SFT on those on-policy pairs
student = AutoModelForCausalLM.from_pretrained("student-model")
args = SFTConfig(max_seq_length=2048, per_device_train_batch_size=4, learning_rate=5e-5)
trainer = SFTTrainer(model=student, args=args, tokenizer=tok, train_dataset=distill_ds)
trainer.train()

```

## Foundations & Systems (Background)

### Proximal Policy Optimization Algorithms
**πŸ“œ Paper**: https://huggingface.co/papers/1707.06347

Foundational **PPO** objective with clipped ratios and minibatch epochsβ€”baseline for many RL/RLHF variants used in TRL.
```python
from trl import GRPOConfig, GRPOTrainer
args = GRPOConfig(
loss_type="grpo",
epsilon=0.2, epsilon_high=0.2, # classic PPO-style symmetric clip
beta=0.01, # KL coef if you want explicit KL
steps_per_generation=4, num_generations=8,
)
trainer = GRPOTrainer(model=..., args=args, tokenizer=..., train_dataset=...)
trainer.train()

```

### ZeRO: Memory Optimizations Toward Training Trillion-Parameter Models
**πŸ“œ Paper**: https://huggingface.co/papers/1910.02054

**ZeRO** partitions optimizer states/gradients/params to scale training efficiently; relevant when configuring DeepSpeed/Accelerate with TRL.
```python
# Most TRL configs forward to HF TrainingArguments under the hood.
# Supply a DeepSpeed ZeRO config via the "deepspeed" argument/path.
from trl import DPOConfig, DPOTrainer

args = DPOConfig(
loss_type="sigmoid",
beta=0.1,
deepspeed="ds_zero_stage2.json", # path to your ZeRO config
bf16=True,
)

# ds_zero_stage2.json (very small sketch)
# {
# "zero_optimization": { "stage": 2 },
# "train_micro_batch_size_per_gpu": 4,
# "gradient_accumulation_steps": 8
# }

trainer = DPOTrainer(model=..., args=args, tokenizer=..., train_dataset=...)
trainer.train()

```