First of all, I want to express my sincere gratitude to the Open-Source-O1 team for their tremendous work and effort in creating and sharing the OpenO1-SFT dataset with the community. This contribution is valuable for advancing open-source AI research.
Issue Description
During my experiments with the OpenO1-SFT dataset, I discovered a potential data leakage issue between the training data and the MATH-500 benchmark test set. This issue came to light when I observed unusually significant performance improvements on the MATH-500 benchmark after fine-tuning models with the OpenO1-SFT dataset.
Investigation Results
Using MinHash LSH algorithm for similarity analysis, I found exact duplicates between the OpenO1-SFT training set and MATH-500 test set. For example:
The following problem appears in both datasets with 100% similarity:
Find the number of integer values of $k$ in the closed interval $[-500,500]$ for which the equation $\log(kx)=2\log(x+2)$ has exactly one real solution.
- Location in OpenO1-SFT: index 22850
- Location in MATH-500: index 80
Impact
This data leakage raises several concerns:
- Models fine-tuned on OpenO1-SFT may show artificially inflated performance on MATH-500
- The validity of MATH-500 as a benchmark for these models is compromised
- Research conclusions based on these evaluations might be misleading
- Since MATH-500 is a subset of the MATH dataset, other MATH-related benchmarks might also be affected
It's worth noting that while this issue has been confirmed for MATH-500, the status of other widely-used benchmarks (such as GSM8K) remains unclear and requires further investigation. However, our analysis shows no evidence of data leakage in newer and more challenging benchmarks like GPQA-Diamond and AIME-2024.
How to Reproduce
You can verify this issue using the following code:
from datasets import load_dataset
from datasketch import MinHash, MinHashLSH
import json
trainset = load_dataset("llamafactory/OpenO1-SFT")["train"]
benchmark = load_dataset("qq8933/MATH500")["test"]
def create_minhash(text, num_perm=128):
m = MinHash(num_perm=num_perm)
# Simple text preprocessing
text = text.lower()
# Simple tokenization (split by space)
tokens = text.split()
# Character-level 3-grams
character_grams = [text[i:i+3] for i in range(len(text)-2)]
# Word-level 2-grams (if there are consecutive words)
word_grams = [' '.join(tokens[i:i+2]) for i in range(len(tokens)-1)]
# Add all features to MinHash
for d in character_grams + word_grams:
m.update(d.encode('utf8'))
return m
# Create LSH index
lsh = MinHashLSH(threshold=0.5, num_perm=128)
# Add training set to LSH index
print("Building training set index...")
train_minhashes = {}
for idx, item in enumerate(trainset):
minhash = create_minhash(item['prompt'])
train_minhashes[idx] = minhash
lsh.insert(f"train_{idx}", minhash)
# Find similar questions
print("\nFinding similar questions...")
similar_pairs = []
for bench_idx, item in enumerate(benchmark):
bench_minhash = create_minhash(item['problem'])
# Query similar training set questions
similar_trains = lsh.query(bench_minhash)
for train_id in similar_trains:
train_idx = int(train_id.split('_')[1])
# Calculate actual Jaccard similarity
similarity = bench_minhash.jaccard(train_minhashes[train_idx])
similar_pairs.append({
'benchmark_idx': bench_idx,
'benchmark_text': item['problem'],
'train_idx': train_idx,
'train_text': trainset[train_idx]['prompt'],
'similarity': similarity
})
# Sort by similarity
similar_pairs.sort(key=lambda x: x['similarity'], reverse=True)
# Output results
print(f"\nFound {len(similar_pairs)} similar question pairs")
print("\nSimilarity distribution:")
similarities = [pair['similarity'] for pair in similar_pairs]
if similarities:
print(f"Maximum similarity: {max(similarities):.3f}")
print(f"Minimum similarity: {min(similarities):.3f}")
print(f"Average similarity: {sum(similarities)/len(similarities):.3f}")
print("\nTop 5 examples with highest similarity:")
for i, pair in enumerate(similar_pairs[:5]):
print(f"\nSimilar pair #{i+1} (similarity: {pair['similarity']:.3f}):")
print(f"Benchmark #{pair['benchmark_idx']}:")
print(pair['benchmark_text'][:200] + "...")
print(f"\nTrainset #{pair['train_idx']}:")
print(pair['train_text'][:200] + "...")
with open('similar_pairs_more_than_0.5.jsonl', 'w') as f:
for pair in similar_pairs:
if pair['similarity'] > 0.5:
f.write(json.dumps(pair) + '\n')
Recommendations
When using SFT datasets for model training, it's crucial to carefully select and verify the benchmarks used for evaluation. Data leakage between training data and test sets can lead to misleading performance metrics that don't accurately reflect the model's true capabilities. This case serves as a reminder of the importance of benchmark integrity in machine learning research.
For researchers working with the OpenO1-SFT dataset, we recommend using newer and more challenging benchmarks such as GPQA-Diamond and AIME-2024, where our analysis has shown no evidence of data leakage.
Once again, I want to express my gratitude to the OpenO1 team for their significant contribution to the open-source AI community. This issue doesn't diminish the value of their work, and I hope this report helps in further improving the dataset's quality.
First of all, I want to express my sincere gratitude to the Open-Source-O1 team for their tremendous work and effort in creating and sharing the OpenO1-SFT dataset with the community. This contribution is valuable for advancing open-source AI research.
Issue Description
During my experiments with the OpenO1-SFT dataset, I discovered a potential data leakage issue between the training data and the MATH-500 benchmark test set. This issue came to light when I observed unusually significant performance improvements on the MATH-500 benchmark after fine-tuning models with the OpenO1-SFT dataset.
Investigation Results
Using MinHash LSH algorithm for similarity analysis, I found exact duplicates between the OpenO1-SFT training set and MATH-500 test set. For example:
The following problem appears in both datasets with 100% similarity:
Impact
This data leakage raises several concerns:
It's worth noting that while this issue has been confirmed for MATH-500, the status of other widely-used benchmarks (such as GSM8K) remains unclear and requires further investigation. However, our analysis shows no evidence of data leakage in newer and more challenging benchmarks like GPQA-Diamond and AIME-2024.
How to Reproduce
You can verify this issue using the following code:
Recommendations
When using SFT datasets for model training, it's crucial to carefully select and verify the benchmarks used for evaluation. Data leakage between training data and test sets can lead to misleading performance metrics that don't accurately reflect the model's true capabilities. This case serves as a reminder of the importance of benchmark integrity in machine learning research.
For researchers working with the OpenO1-SFT dataset, we recommend using newer and more challenging benchmarks such as GPQA-Diamond and AIME-2024, where our analysis has shown no evidence of data leakage.
Once again, I want to express my gratitude to the OpenO1 team for their significant contribution to the open-source AI community. This issue doesn't diminish the value of their work, and I hope this report helps in further improving the dataset's quality.