Try the following commands to install the environment:
mamba env create -f environment.ymlTry the following commands to generate the dataset:
bash scripts/sampling.sh
bash scripts/pipeline_n4_gemma.shTry following commands to train PAD model:
bash run_ppd.shYou can find the trained model under outputs/*.
Please ensure that the file paths in the following file match your configuration:
File: training_configs/gemma-2-2b-it-pd.yaml
We follow the official implementation for evaluation on AlpacaEval 2, Arena-Hard, MT-Bench and GSM8K.
-
AlpacaEval 2: Please refer to the AlpacaEval repo for evaluation.
-
Arena-Hard: Please refer to to the Arena-Hard-Auto repo for evaluation.
-
MT-Bench: Please refer to the FastChat repo for evaluation.
-
GSM8K: Please refer to the ZeroEval repo for evaluation.
This part contains training logs and comparative analysis of three preference alignment methods: SimPO, DPO, and PAD. We document the training process, implementation details, and performance metrics for each approach.
- DPO: Based on the implementation from TRL
- SimPO: Based on the implementation from princeton-nlp/SimPO
- Student Model: Gemma-2-2B-It
- Teacher Model: Gemma-2-9B-It
- GPUs: 2 × A800 (80G)
- Training Type: Full parameter fine-tuning
- Memory Optimization: ZeRO Stage 2
- Epochs: 1
- Precision: BFloat16
- Dataset Size:
- Training samples: 55,321
- Test samples: 1,130
- Batch Size: 128
- Total Training Steps: 432
- Maximum Sequence Length: 2048
- Per Device Train Batch Size: 2
- Per Device Evaluation Batch Size: 2
- Gradient Accumulation Steps: 32
- Evaluation Frequency: Every 100 training steps
- Gradient Checkpointing: Enabled
For additional parameters, please refer to the paper or the configuration files.
| Method | GPU Hours | Alpaca-Eval 2.0 LC (%) |
|---|---|---|
| DPO | 8.7856 | 43.77 |
| SimPO | 7.2672 | 44.94 |
| PAD | 7.2884 | 45.73 |
You can find the training log under gemma-log/*.
- Training Efficiency: PAD and SimPO require similar computational resources, while DPO demands notably more. This efficiency difference is primarily because DPO requires loading an additional reference model during training, whereas PAD and SimPO do not.
- Performance: PAD outperforms both SimPO and DPO in terms of win rate, which aligns with the findings reported in the submission paper.