Data Leakage Issue in OpenO1-SFT Dataset

First of all, I want to express my sincere gratitude to the Open-Source-O1 team for their tremendous work and effort in creating and sharing the OpenO1-SFT dataset with the community. This contribution is valuable for advancing open-source AI research.

## Issue Description

During my experiments with the OpenO1-SFT dataset, I discovered a potential data leakage issue between the training data and the MATH-500 benchmark test set. This issue came to light when I observed unusually significant performance improvements on the MATH-500 benchmark after fine-tuning models with the OpenO1-SFT dataset.

## Investigation Results

Using MinHash LSH algorithm for similarity analysis, I found exact duplicates between the OpenO1-SFT training set and MATH-500 test set. For example:

The following problem appears in both datasets with 100% similarity:
```
Find the number of integer values of $k$ in the closed interval $[-500,500]$ for which the equation $\log(kx)=2\log(x+2)$ has exactly one real solution.
```
- Location in OpenO1-SFT: index 22850
- Location in MATH-500: index 80

## Impact

This data leakage raises several concerns:
1. Models fine-tuned on OpenO1-SFT may show artificially inflated performance on MATH-500
2. The validity of MATH-500 as a benchmark for these models is compromised
3. Research conclusions based on these evaluations might be misleading
4. Since MATH-500 is a subset of the MATH dataset, other MATH-related benchmarks might also be affected

It's worth noting that while this issue has been confirmed for MATH-500, the status of other widely-used benchmarks (such as GSM8K) remains unclear and requires further investigation. However, our analysis shows no evidence of data leakage in newer and more challenging benchmarks like GPQA-Diamond and AIME-2024.

## How to Reproduce

You can verify this issue using the following code:

```python
from datasets import load_dataset
from datasketch import MinHash, MinHashLSH
import json

trainset = load_dataset("llamafactory/OpenO1-SFT")["train"]
benchmark = load_dataset("qq8933/MATH500")["test"]

def create_minhash(text, num_perm=128):
    m = MinHash(num_perm=num_perm)
    
    # Simple text preprocessing
    text = text.lower()
    # Simple tokenization (split by space)
    tokens = text.split()
    
    # Character-level 3-grams
    character_grams = [text[i:i+3] for i in range(len(text)-2)]
    
    # Word-level 2-grams (if there are consecutive words)
    word_grams = [' '.join(tokens[i:i+2]) for i in range(len(tokens)-1)]
    
    # Add all features to MinHash
    for d in character_grams + word_grams:
        m.update(d.encode('utf8'))
    
    return m

# Create LSH index
lsh = MinHashLSH(threshold=0.5, num_perm=128)

# Add training set to LSH index
print("Building training set index...")
train_minhashes = {}
for idx, item in enumerate(trainset):
    minhash = create_minhash(item['prompt'])
    train_minhashes[idx] = minhash
    lsh.insert(f"train_{idx}", minhash)

# Find similar questions
print("\nFinding similar questions...")
similar_pairs = []
for bench_idx, item in enumerate(benchmark):
    bench_minhash = create_minhash(item['problem'])
    
    # Query similar training set questions
    similar_trains = lsh.query(bench_minhash)
    
    for train_id in similar_trains:
        train_idx = int(train_id.split('_')[1])
        # Calculate actual Jaccard similarity
        similarity = bench_minhash.jaccard(train_minhashes[train_idx])
        
        similar_pairs.append({
            'benchmark_idx': bench_idx,
            'benchmark_text': item['problem'],
            'train_idx': train_idx,
            'train_text': trainset[train_idx]['prompt'],
            'similarity': similarity
        })

# Sort by similarity
similar_pairs.sort(key=lambda x: x['similarity'], reverse=True)

# Output results
print(f"\nFound {len(similar_pairs)} similar question pairs")
print("\nSimilarity distribution:")
similarities = [pair['similarity'] for pair in similar_pairs]
if similarities:
    print(f"Maximum similarity: {max(similarities):.3f}")
    print(f"Minimum similarity: {min(similarities):.3f}")
    print(f"Average similarity: {sum(similarities)/len(similarities):.3f}")

print("\nTop 5 examples with highest similarity:")
for i, pair in enumerate(similar_pairs[:5]):
    print(f"\nSimilar pair #{i+1} (similarity: {pair['similarity']:.3f}):")
    print(f"Benchmark #{pair['benchmark_idx']}:")
    print(pair['benchmark_text'][:200] + "...")
    print(f"\nTrainset #{pair['train_idx']}:")
    print(pair['train_text'][:200] + "...")

with open('similar_pairs_more_than_0.5.jsonl', 'w') as f:
    for pair in similar_pairs:
        if pair['similarity'] > 0.5:
            f.write(json.dumps(pair) + '\n')
```

## Recommendations

When using SFT datasets for model training, it's crucial to carefully select and verify the benchmarks used for evaluation. Data leakage between training data and test sets can lead to misleading performance metrics that don't accurately reflect the model's true capabilities. This case serves as a reminder of the importance of benchmark integrity in machine learning research.

For researchers working with the OpenO1-SFT dataset, we recommend using newer and more challenging benchmarks such as GPQA-Diamond and AIME-2024, where our analysis has shown no evidence of data leakage.

Once again, I want to express my gratitude to the OpenO1 team for their significant contribution to the open-source AI community. This issue doesn't diminish the value of their work, and I hope this report helps in further improving the dataset's quality.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Leakage Issue in OpenO1-SFT Dataset #10

Issue Description

Investigation Results

Impact

How to Reproduce

Recommendations

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Data Leakage Issue in OpenO1-SFT Dataset #10

Description

Issue Description

Investigation Results

Impact

How to Reproduce

Recommendations

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions