Skip to content

ASTRAL-Group/LRM_Conta_Detection_Arena

Repository files navigation

LRM Contamination Detection Arena

Hugging Face Badge

This repository provides the pipeline accompanying the paper On the Fragility of Benchmark Contamination Detection in Reasoning Models by Han Wang*, Haoyu Li*, Brian Ko*, and Huan Zhang (* equal contribution).

teaser

Environment

To prepare the environment for the benchmark contamination detection for LRMs, you can run the following commands:

conda create --name lrm_conta python==3.10.16
conda activate lrm_conta
pip install --no-build-isolation -r requirements.txt

Checkpoints and Datasets

We provide the datasets with member and non-members labels in ./datasets and release the model checkpoints as belows. We report AUROC for the LOSS detector (Carlini et al., 2021), to show that RL can conceal contamination introduced when base model evolves into LRMs. $\Delta$ measures the difference with the SFT-contaminated model w/o RL.

Models Olympiad GPQA AIME25 AIME24 Minerva AMC23 Avg. $\Delta$
SFT conta Qwen-7B
w/o further RL
69.18 69.81 86.22 77.33 70.95 79.38 75.48 +0.00
SFT conta Qwen-7B
w/ GRPO (Step 64)
55.22 60.50 62.44 65.78 61.50 62.12 61.26 -14.22
SFT conta Llama-8B
w/o further RL
69.44 72.88 98.67 99.56 76.91 94.50 85.33 +0.00
SFT conta Llama-8B
w/ GRPO (Step 64)
59.18 69.84 60.00 68.00 74.02 76.75 67.97 -17.36

We also list LOSS results on SFT contaminated LRMs here.

Models Olympiad GPQA AIME25 AIME24 Minerva AMC23 Avg.
SFT conta Deepseek distill Llama-8B 57.91 61.78 52.89 73.56 67.00 62.38 62.59
SFT conta Deepseek distill Qwen-7B 49.77 54.09 52.00 63.78 54.76 56.75 55.19
SFT conta OpenThinker-7B (15K) 53.44 57.61 61.33 56.67 55.07 48.12 55.37

For SFT training, please refer to Llama-factory. For RL training, please refer to verl.

Inference

Our inference pipeline supports data parallel. You can adjust the GPU_LIST and GPUS_PER_JOB in the async_inference.sh to shard the datasets based on your hardware resources. To run the inference, please execute the following command:

bash async_inference.sh

To get the pass@1 metric without running the detection approaches, execute the following command:

bash check_pass_1.sh

Detection

The released pipeline supports:

To run the detection scripts, execute the following command:

bash async_detection.sh
bash check_auroc.sh

You can modify the MIAS list in async_detection.sh to run different detection methods: loss would save results of Zlib, Min-K%, Max-K%, and Loss. ref would save results of LiRA, and Ref. min-k++ would save Min-k%++, and perturb would save Neighbor. And we are happy to support more contamination detection methods tailored to LRMs.

Citation

If you fine our code useful, please consider citing our paper:

@article{wang2025fragility,
  title={On the Fragility of Benchmark Contamination Detection in Reasoning Models},
  author={Wang, Han and Li, Haoyu and Ko, Brian and Zhang, Huan},
  journal={arXiv preprint arXiv:2510.02386},
  year={2025}
}

About

Official implementation for "On the Fragility of Benchmark Contamination Detection in Reasoning Models"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •