LRM Contamination Detection Arena

This repository provides the pipeline accompanying the paper On the Fragility of Benchmark Contamination Detection in Reasoning Models by Han Wang*, Haoyu Li*, Brian Ko*, and Huan Zhang (* equal contribution).

Environment

To prepare the environment for the benchmark contamination detection for LRMs, you can run the following commands:

conda create --name lrm_conta python==3.10.16
conda activate lrm_conta
pip install --no-build-isolation -r requirements.txt

Checkpoints and Datasets

We provide the datasets with member and non-members labels in ./datasets and release the model checkpoints as belows. We report AUROC for the LOSS detector (Carlini et al., 2021), to show that RL can conceal contamination introduced when base model evolves into LRMs. $\Delta$ measures the difference with the SFT-contaminated model w/o RL.

Models	Olympiad	GPQA	AIME25	AIME24	Minerva	AMC23	Avg.	$\Delta$
SFT conta Qwen-7B w/o further RL	69.18	69.81	86.22	77.33	70.95	79.38	75.48	+0.00
SFT conta Qwen-7B w/ GRPO (Step 64)	55.22	60.50	62.44	65.78	61.50	62.12	61.26	-14.22
SFT conta Llama-8B w/o further RL	69.44	72.88	98.67	99.56	76.91	94.50	85.33	+0.00
SFT conta Llama-8B w/ GRPO (Step 64)	59.18	69.84	60.00	68.00	74.02	76.75	67.97	-17.36

We also list LOSS results on SFT contaminated LRMs here.

Models	Olympiad	GPQA	AIME25	AIME24	Minerva	AMC23	Avg.
SFT conta Deepseek distill Llama-8B	57.91	61.78	52.89	73.56	67.00	62.38	62.59
SFT conta Deepseek distill Qwen-7B	49.77	54.09	52.00	63.78	54.76	56.75	55.19
SFT conta OpenThinker-7B (15K)	53.44	57.61	61.33	56.67	55.07	48.12	55.37

For SFT training, please refer to Llama-factory. For RL training, please refer to verl.

Inference

Our inference pipeline supports data parallel. You can adjust the GPU_LIST and GPUS_PER_JOB in the async_inference.sh to shard the datasets based on your hardware resources. To run the inference, please execute the following command:

bash async_inference.sh

To get the pass@1 metric without running the detection approaches, execute the following command:

bash check_pass_1.sh

Detection

The released pipeline supports:

Perturbation-based detection: Neighbor (Mattern et al., 2023)
Reference-based detection: LiRA (Mireshghallah et al., 2022), Ref (Carlini et al., 2021)
Reference-free detection: Zlib (Carlini et al., 2021), Min-K%++ (Zhang et al., 2024), Min-K% (Shi et al., 2023), Max-K% (Maini et al., 2024), Loss (Carlini et al., 2021)

To run the detection scripts, execute the following command:

bash async_detection.sh
bash check_auroc.sh

You can modify the MIAS list in async_detection.sh to run different detection methods: loss would save results of Zlib, Min-K%, Max-K%, and Loss. ref would save results of LiRA, and Ref. min-k++ would save Min-k%++, and perturb would save Neighbor. And we are happy to support more contamination detection methods tailored to LRMs.

Citation

If you fine our code useful, please consider citing our paper:

@article{wang2025fragility,
  title={On the Fragility of Benchmark Contamination Detection in Reasoning Models},
  author={Wang, Han and Li, Haoyu and Ko, Brian and Zhang, Huan},
  journal={arXiv preprint arXiv:2510.02386},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets		assets
datasets		datasets
deepscaler		deepscaler
.gitignore		.gitignore
README.md		README.md
async_detection.sh		async_detection.sh
async_inference.sh		async_inference.sh
check_auroc.sh		check_auroc.sh
check_pass_1.sh		check_pass_1.sh
collect_result.py		collect_result.py
inference.py		inference.py
remove_mia_results.py		remove_mia_results.py
report_metrics.py		report_metrics.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LRM Contamination Detection Arena

Environment

Checkpoints and Datasets

Inference

Detection

Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

ASTRAL-Group/LRM_Conta_Detection_Arena

Folders and files

Latest commit

History

Repository files navigation

LRM Contamination Detection Arena

Environment

Checkpoints and Datasets

Inference

Detection

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages