This repository provides the pipeline accompanying the paper On the Fragility of Benchmark Contamination Detection in Reasoning Models by Han Wang*, Haoyu Li*, Brian Ko*, and Huan Zhang (* equal contribution).
To prepare the environment for the benchmark contamination detection for LRMs, you can run the following commands:
conda create --name lrm_conta python==3.10.16
conda activate lrm_conta
pip install --no-build-isolation -r requirements.txtWe provide the datasets with member and non-members labels in ./datasets and release the model checkpoints as belows. We report AUROC for the LOSS detector (Carlini et al., 2021), to show that RL can conceal contamination introduced when base model evolves into LRMs.
| Models | Olympiad | GPQA | AIME25 | AIME24 | Minerva | AMC23 | Avg. | |
|---|---|---|---|---|---|---|---|---|
| SFT conta Qwen-7B w/o further RL |
69.18 | 69.81 | 86.22 | 77.33 | 70.95 | 79.38 | 75.48 | +0.00 |
| SFT conta Qwen-7B w/ GRPO (Step 64) |
55.22 | 60.50 | 62.44 | 65.78 | 61.50 | 62.12 | 61.26 | -14.22 |
| SFT conta Llama-8B w/o further RL |
69.44 | 72.88 | 98.67 | 99.56 | 76.91 | 94.50 | 85.33 | +0.00 |
| SFT conta Llama-8B w/ GRPO (Step 64) |
59.18 | 69.84 | 60.00 | 68.00 | 74.02 | 76.75 | 67.97 | -17.36 |
We also list LOSS results on SFT contaminated LRMs here.
| Models | Olympiad | GPQA | AIME25 | AIME24 | Minerva | AMC23 | Avg. |
|---|---|---|---|---|---|---|---|
| SFT conta Deepseek distill Llama-8B | 57.91 | 61.78 | 52.89 | 73.56 | 67.00 | 62.38 | 62.59 |
| SFT conta Deepseek distill Qwen-7B | 49.77 | 54.09 | 52.00 | 63.78 | 54.76 | 56.75 | 55.19 |
| SFT conta OpenThinker-7B (15K) | 53.44 | 57.61 | 61.33 | 56.67 | 55.07 | 48.12 | 55.37 |
For SFT training, please refer to Llama-factory. For RL training, please refer to verl.
Our inference pipeline supports data parallel. You can adjust the GPU_LIST and GPUS_PER_JOB in the async_inference.sh to shard the datasets based on your hardware resources. To run the inference, please execute the following command:
bash async_inference.shTo get the pass@1 metric without running the detection approaches, execute the following command:
bash check_pass_1.shThe released pipeline supports:
- Perturbation-based detection: Neighbor (Mattern et al., 2023)
- Reference-based detection: LiRA (Mireshghallah et al., 2022), Ref (Carlini et al., 2021)
- Reference-free detection: Zlib (Carlini et al., 2021), Min-K%++ (Zhang et al., 2024), Min-K% (Shi et al., 2023), Max-K% (Maini et al., 2024), Loss (Carlini et al., 2021)
To run the detection scripts, execute the following command:
bash async_detection.sh
bash check_auroc.shYou can modify the MIAS list in async_detection.sh to run different detection methods: loss would save results of Zlib, Min-K%, Max-K%, and Loss. ref would save results of LiRA, and Ref. min-k++ would save Min-k%++, and perturb would save Neighbor. And we are happy to support more contamination detection methods tailored to LRMs.
If you fine our code useful, please consider citing our paper:
@article{wang2025fragility,
title={On the Fragility of Benchmark Contamination Detection in Reasoning Models},
author={Wang, Han and Li, Haoyu and Ko, Brian and Zhang, Huan},
journal={arXiv preprint arXiv:2510.02386},
year={2025}
}
