-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Research Needed: Dataset Sample Optimization & Alternative Fine-tuning Techniques
Current Setup
- SFT Stage: 56k samples (math, reading, science, general)
- CoT Stage: 22.5k samples (reasoning focused)
- Format: Standard Instruction-Response pairs
Research Goals
1. Sample Size Optimization
- Find optimal dataset sizes for TinyLlama-1.1B
- Test scaling from 10k to 100k+ samples
- Determine quality vs quantity trade-offs
- Identify diminishing returns threshold
2. Alternative Training Formats
Beyond basic instruction-response:
- Conversational: Multi-turn dialogs, ChatML format
- Completion-based: Raw text, document continuation
- Task-specific: Q&A pairs, code generation, summarization
- Advanced: Few-shot examples, chain-of-thought variations
What We Need
- Performance analysis across different sample sizes
- Comparison of training formats on same content
- Benchmark results (GSM8K, ARC, HellaSwag)
- Code and configurations for reproducible experiments
Deliverables
- Research report with recommendations
- Preprocessing scripts for new formats
- Training configurations and evaluation tools
- Docs
Focus on practical improvements to training efficiency and model performance.
Metadata
Metadata
Assignees
Labels
No labels