This project applies Group Relative Policy Optimization (GRPO) to fine-tune five small LLMs (135M–2B parameters) so that they generate syntactically valid, schema-conformant JSON. Instead of a neural reward model, up to seven rule-based reward components score each completion, providing a dense, deterministic additive signal.
We extensively evaluated four distinct training strategies across 2 500 training steps (using 4-bit NF4 quantization and LoRA r=16 on a single NVIDIA L40S GPU):
- No-Think Standard: Direct JSON generation on a mixed-difficulty dataset.
- No-Think Curriculum: Direct generation with progressive difficulty scaling (3 stages).
- Think Standard: Intermediate
<think>reasoning steps required on a mixed dataset. - Think Curriculum: Intermediate
<think>steps with progressive difficulty scaling.
📖 For comprehensive theoretical details, heatmaps, error evolution, and cross-modality analyses, read the full REPORT.md.
| Model | Parameters | Architecture |
|---|---|---|
| SmolLM2-135M-Instruct | 135M | LLaMA-like |
| SmolLM2-360M-Instruct | 360M | LLaMA-like |
| Qwen2.5-0.5B-Instruct | 0.5B | Qwen2.5 |
| TinyLlama-1.1B-Chat-v1.0 | 1.1B | LLaMA 2 |
| Gemma-2-2B-it | 2B | Gemma 2 |
After 2 500 GRPO training steps, all five models converged to a tight 87–99% Pass@1 band. The addition of reasoning tokens and Curriculum Learning proved exceptionally transformative for smaller, capacity-constrained models:
| Model | Baseline | Peak Post-GRPO | Best Configuration | Max Absolute Gain |
|---|---|---|---|---|
| SmolLM2-135M | ~31% | 90.00% | Think / Standard | +58.67 pp |
| SmolLM2-360M | ~78% | 96.67% | No-Think / Standard | +19.00 pp |
| Qwen2.5-0.5B | ~92% | 98.00% | Think / Curriculum | +6.33 pp |
| TinyLlama-1.1B | ~79% | 99.33% | No-Think / Curriculum | +20.00 pp |
| Gemma-2-2B | ~96% | 97.67% | Think / Curriculum | +1.33 pp |
Note: Baselines varied slightly between Think/No-Think system prompts. See the full report for detailed stage-by-stage breakdowns.
├── 📁 cluster/ # Slurm scripts and cluster management
├── 📁 data/ # Synthetic dataset generation outputs
├── 📁 docs/ # Documentation and Papers
│ ├── 📝 REPORT.md # <--- FULL RESULTS AND ANALYSIS HERE
│ └── ...
├── 📁 experiments/
│ ├── 📁 configs/ # YAML configurations for training & eval
│ └── 📁 logs/grpo/
│ ├── 📁 nothink/
│ │ ├── 📁 standard/ # Evaluation artifacts & figures (No-Think)
│ │ └── 📁 curriculum/
│ └── 📁 think/
│ ├── 📁 standard/ # Evaluation artifacts & figures (Think)
│ └── 📁 curriculum/
├── 📁 notebooks/ # Jupyter notebooks for fast prototyping
├── 📁 src/
│ ├── 📁 datasets/ # Synthetic data generation and loaders
│ ├── 📁 evaluation/ # Pass@1, schema validation, figure generation
│ ├── 📁 models/ # LoRA and quantization utilities
│ ├── 📁 training/ # GRPO Trainer and Curriculum logic
│ │ └── 🐍 rewards.py # 7-component rule-based reward system
│ └── 📁 utils/
└── 📄 pyproject.toml # uv dependencies
The framework utilizes a purely rule-based approach, avoiding the overhead of a neural reward model. The total reward is an additive combination of up to seven components. When reasoning (thinking: false) is disabled, its weight is automatically redistributed to preserve component ratios.
| Component | Purpose |
|---|---|
| Format | Checks for a proper ```json ... ``` markdown code fence. |
| Validity | Graduated score based on JSON parseability (partial credit for late-string errors). |
| Schema | Structural conformance to exact constraints (keys, types, counts, nesting depth). |
| Reasoning | Evaluates <think>…</think> blocks for minimum character count and originality. |
| Truncation | Penalizes generations interrupted mid-token (e.g., unclosed braces/brackets). |
| Repetition | Penalizes degenerate loops (token looping, repeated lines, duplicate code blocks). |
| Strictness | Penalizes "chatty" text outside the requested JSON or Think blocks. |
Prerequisites: Python 3.10–3.12 and uv.
git clone https://github.com/GiuseppeBellamacina/grpo-strict-generation.git
cd grpo-strict-generation
pip install uv # if not already installed
uv sync # core dependencies
uv sync --extra dev # + ruff, pytest, blackCluster setup: see QUICK_SETUP.md for step-by-step instructions or CLUSTER.md for the full Slurm guide.
The training and evaluation pipelines generate the dataset automatically from the YAML config. You only need this to inspect it independently:
uv run python -m src.datasets.synthetic_dataset \
--output data/synthetic \
--num_samples 5000 \
--test_ratio 0.2Evaluate off-the-shelf models without any fine-tuning:
uv run python -m src.evaluation --config experiments/configs/baseline.yamlEach model has its own config. To enable reasoning or curriculum learning, adjust the thinking and curriculum flags in the respective YAML files.
# Single model training
uv run python -m src.training --config experiments/configs/grpo_smollm2_135m.yaml
# Resume from checkpoint
uv run python -m src.training --config experiments/configs/grpo_smollm2_135m.yaml --resumeOn the cluster, use the multi-model chain pipeline:
run-all # train + eval all models sequentially
run-all --models=1,2,3 # specific models only
monitor # live dashboard (compact)# Evaluate final checkpoint vs baseline
uv run python -m src.evaluation --config experiments/configs/grpo_smollm2_135m.yaml --compare
# Evaluate all curriculum stages + baseline (generates the full analysis suite)
uv run python -m src.evaluation --config experiments/configs/grpo_smollm2_135m.yaml --curriculum