Align a Small LLM with GRPO for Strict JSON Generation

Overview

This project applies Group Relative Policy Optimization (GRPO) to fine-tune five small LLMs (135M–2B parameters) so that they generate syntactically valid, schema-conformant JSON. Instead of a neural reward model, up to seven rule-based reward components score each completion, providing a dense, deterministic additive signal.

We extensively evaluated four distinct training strategies across 2 500 training steps (using 4-bit NF4 quantization and LoRA r=16 on a single NVIDIA L40S GPU):

No-Think Standard: Direct JSON generation on a mixed-difficulty dataset.
No-Think Curriculum: Direct generation with progressive difficulty scaling (3 stages).
Think Standard: Intermediate <think> reasoning steps required on a mixed dataset.
Think Curriculum: Intermediate <think> steps with progressive difficulty scaling.

📖 For comprehensive theoretical details, heatmaps, error evolution, and cross-modality analyses, read the full REPORT.md.

Models

Model	Parameters	Architecture
SmolLM2-135M-Instruct	135M	LLaMA-like
SmolLM2-360M-Instruct	360M	LLaMA-like
Qwen2.5-0.5B-Instruct	0.5B	Qwen2.5
TinyLlama-1.1B-Chat-v1.0	1.1B	LLaMA 2
Gemma-2-2B-it	2B	Gemma 2

Key Results (Peak Performance)

After 2 500 GRPO training steps, all five models converged to a tight 87–99% Pass@1 band. The addition of reasoning tokens and Curriculum Learning proved exceptionally transformative for smaller, capacity-constrained models:

Model	Baseline	Peak Post-GRPO	Best Configuration	Max Absolute Gain
SmolLM2-135M	~31%	90.00%	Think / Standard	+58.67 pp
SmolLM2-360M	~78%	96.67%	No-Think / Standard	+19.00 pp
Qwen2.5-0.5B	~92%	98.00%	Think / Curriculum	+6.33 pp
TinyLlama-1.1B	~79%	99.33%	No-Think / Curriculum	+20.00 pp
Gemma-2-2B	~96%	97.67%	Think / Curriculum	+1.33 pp

Note: Baselines varied slightly between Think/No-Think system prompts. See the full report for detailed stage-by-stage breakdowns.

Repository Structure

Complete Repository Structure

├── 📁 cluster/                     # Slurm scripts and cluster management
├── 📁 data/                        # Synthetic dataset generation outputs
├── 📁 docs/                        # Documentation and Papers
│   ├── 📝 REPORT.md                # <--- FULL RESULTS AND ANALYSIS HERE
│   └── ...
├── 📁 experiments/
│   ├── 📁 configs/                 # YAML configurations for training & eval
│   └── 📁 logs/grpo/
│       ├── 📁 nothink/
│       │   ├── 📁 standard/        # Evaluation artifacts & figures (No-Think)
│       │   └── 📁 curriculum/
│       └── 📁 think/
│           ├── 📁 standard/        # Evaluation artifacts & figures (Think)
│           └── 📁 curriculum/
├── 📁 notebooks/                   # Jupyter notebooks for fast prototyping
├── 📁 src/
│   ├── 📁 datasets/                # Synthetic data generation and loaders
│   ├── 📁 evaluation/              # Pass@1, schema validation, figure generation
│   ├── 📁 models/                  # LoRA and quantization utilities
│   ├── 📁 training/                # GRPO Trainer and Curriculum logic
│   │   └── 🐍 rewards.py           # 7-component rule-based reward system
│   └── 📁 utils/
└── 📄 pyproject.toml               # uv dependencies

Reward Function

The framework utilizes a purely rule-based approach, avoiding the overhead of a neural reward model. The total reward is an additive combination of up to seven components. When reasoning (thinking: false) is disabled, its weight is automatically redistributed to preserve component ratios.

Component	Purpose
Format	Checks for a proper ```json ... ``` markdown code fence.
Validity	Graduated score based on JSON parseability (partial credit for late-string errors).
Schema	Structural conformance to exact constraints (keys, types, counts, nesting depth).
Reasoning	Evaluates `<think>…</think>` blocks for minimum character count and originality.
Truncation	Penalizes generations interrupted mid-token (e.g., unclosed braces/brackets).
Repetition	Penalizes degenerate loops (token looping, repeated lines, duplicate code blocks).
Strictness	Penalizes "chatty" text outside the requested JSON or Think blocks.

Setup

Prerequisites: Python 3.10–3.12 and uv.

git clone https://github.com/GiuseppeBellamacina/grpo-strict-generation.git
cd grpo-strict-generation

pip install uv          # if not already installed
uv sync                 # core dependencies
uv sync --extra dev     # + ruff, pytest, black

Cluster setup: see QUICK_SETUP.md for step-by-step instructions or CLUSTER.md for the full Slurm guide.

Usage

1. Generate the Synthetic Dataset (optional)

The training and evaluation pipelines generate the dataset automatically from the YAML config. You only need this to inspect it independently:

uv run python -m src.datasets.synthetic_dataset \
    --output data/synthetic \
    --num_samples 5000 \
    --test_ratio 0.2

2. Baseline Evaluation

Evaluate off-the-shelf models without any fine-tuning:

uv run python -m src.evaluation --config experiments/configs/baseline.yaml

3. GRPO Training

Each model has its own config. To enable reasoning or curriculum learning, adjust the thinking and curriculum flags in the respective YAML files.

# Single model training
uv run python -m src.training --config experiments/configs/grpo_smollm2_135m.yaml

# Resume from checkpoint
uv run python -m src.training --config experiments/configs/grpo_smollm2_135m.yaml --resume

On the cluster, use the multi-model chain pipeline:

run-all                    # train + eval all models sequentially
run-all --models=1,2,3     # specific models only
monitor                    # live dashboard (compact)

4. Post-Training Evaluation

# Evaluate final checkpoint vs baseline
uv run python -m src.evaluation --config experiments/configs/grpo_smollm2_135m.yaml --compare

# Evaluate all curriculum stages + baseline (generates the full analysis suite)
uv run python -m src.evaluation --config experiments/configs/grpo_smollm2_135m.yaml --curriculum

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 239 Commits
.devcontainer		.devcontainer
.githooks		.githooks
.vscode		.vscode
cluster		cluster
data		data
docs		docs
experiments		experiments
notebooks		notebooks
src		src
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
format.ps1		format.ps1
format.sh		format.sh
pyproject.toml		pyproject.toml
setup.sh		setup.sh
sync_cluster.ps1		sync_cluster.ps1
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Align a Small LLM with GRPO for Strict JSON Generation

Overview

Models

Key Results (Peak Performance)

Repository Structure

Reward Function

Setup

Usage

1. Generate the Synthetic Dataset (optional)

2. Baseline Evaluation

3. GRPO Training

4. Post-Training Evaluation

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Align a Small LLM with GRPO for Strict JSON Generation

Overview

Models

Key Results (Peak Performance)

Repository Structure

Reward Function

Setup

Usage

1. Generate the Synthetic Dataset (optional)

2. Baseline Evaluation

3. GRPO Training

4. Post-Training Evaluation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages