Synthetic Data Kit

A powerful CLI tool for generating high-quality synthetic datasets to fine-tune Large Language Models (LLMs). Easily create reasoning traces, QA pairs, and convert them into fine-tuning formats.

Overview

Fine-tuning LLMs is straightforward with mature tools available for the Llama model family. However, preparing your data in the right format can be challenging. Synthetic Data Kit simplifies this by:

Using LLMs (Ollama, OpenAI, or custom API endpoints) to generate examples
Providing a modular 4-command workflow
Converting existing files into fine-tuning-friendly formats
Supporting various post-training formats

The toolkit follows a simple CLI structure with 4 main commands:

ingest: Parse various file formats
create: Generate fine-tuning data (QA pairs, CoT reasoning, summaries)
curate: Filter high-quality examples using LLM-as-a-judge
save-as: Convert to your preferred fine-tuning format

Key Features

Multi-Provider Support: Ollama, OpenAI, vLLM, and custom API endpoints
File Format Support: PDF, HTML, DOCX, PPTX, TXT, YouTube transcripts
Data Types: QA pairs, Chain-of-Thought reasoning, Multimodal QA
Batch Processing: Handle entire directories of files
Intelligent Chunking: Automatic handling of large documents
Quality Curation: LLM-powered filtering for high-quality data
Multiple Output Formats: Alpaca, ChatML, JSONL, OpenAI Fine-Tuning
Preview Mode: See what files will be processed before running
Flexible Configuration: YAML-based config with CLI overrides
Language Control: Respond in English (default) or match the input document language

Installation

From PyPI (Recommended)

# Create a virtual environment
conda create -n synthetic-data python=3.10
conda activate synthetic-data

# Install the package
pip install synthetic-data-kit

From Source

# Clone the repository
git clone https://github.com/meta-llama/synthetic-data-kit.git
cd synthetic-data-kit

# Install in development mode
pip install -e .

Dependencies

Install additional dependencies based on your needs:

# For PDF processing
pip install pdfminer.six

# For HTML processing
pip install beautifulsoup4

# For YouTube transcripts
pip install pytubefix youtube-transcript-api

# For Office documents
pip install python-docx python-pptx

# For enhanced JSON parsing
pip install json5

Quick Start

Set up your environment:

# Create data directories
mkdir -p data/{input,parsed,generated,curated,final}

Choose your LLM provider:

Ollama (Local, recommended):

curl -fsSL https://ollama.ai/install.sh | sh
ollama pull llama3.2:3b

OpenAI:

export OPENAI_API_KEY="your-api-key-here"

Or create a .env file (auto-loaded):

echo 'OPENAI_API_KEY=your-api-key-here' > .env

vLLM:

pip install vllm
vllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000

Process your first document:

# Check system
synthetic-data-kit system-check --provider ollama

# Ingest and process
synthetic-data-kit ingest research_paper.pdf
synthetic-data-kit create data/parsed/research_paper.txt --type qa
synthetic-data-kit curate data/generated/research_paper_qa_pairs.json
synthetic-data-kit save-as data/curated/research_paper_cleaned.json --format alpaca

LLM Providers & Models

Synthetic Data Kit supports multiple LLM providers with flexible model selection:

Supported Providers

Ollama (Local, recommended for privacy)
OpenAI (Cloud, high quality)
vLLM (Local inference server)
API Endpoint (Custom/compatible APIs)

Provider Setup

Ollama (Recommended)

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull recommended models
ollama pull llama3.2:3b      # Fast, good quality
ollama pull llama3.1:8b      # Better quality, slower
ollama pull mistral:7b       # Alternative option

# Use in toolkit
synthetic-data-kit create document.txt --provider ollama --model llama3.2:3b

OpenAI

# Set API key
export OPENAI_API_KEY="your-api-key-here"

# Use in toolkit
synthetic-data-kit create document.txt --provider openai --model gpt-4o
synthetic-data-kit create document.txt --provider openai --model gpt-3.5-turbo

vLLM

# Install vLLM
pip install vllm

# Start server
vllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000

# Use in toolkit
synthetic-data-kit create document.txt --provider vllm --model meta-llama/Llama-3.3-70B-Instruct

Custom API Endpoint

# Use any OpenAI-compatible API
synthetic-data-kit create document.txt --provider api-endpoint --model your-model-name

Model Recommendations

Use Case	Ollama Model	OpenAI Model	Performance
Quick testing	`llama3.2:3b`	`gpt-3.5-turbo`	Fast, good quality
High quality	`llama3.1:8b`	`gpt-4o`	Slower, best quality
Local only	`mistral:7b`	N/A	Good balance

Usage

System Check

Verify your LLM provider is working:

# Check Ollama
synthetic-data-kit system-check --provider ollama

# Check OpenAI
synthetic-data-kit system-check --provider openai

# Check vLLM
synthetic-data-kit system-check --provider vllm

Ingest

Parse documents into a processable format:

# Single file
synthetic-data-kit ingest document.pdf

# Directory (batch processing)
synthetic-data-kit ingest ./documents/

# YouTube video
synthetic-data-kit ingest "https://www.youtube.com/watch?v=VIDEO_ID"

# Multimodal (extract text and images)
synthetic-data-kit ingest document.pdf --multimodal

# Preview mode
synthetic-data-kit ingest ./documents/ --preview

# PDF page range (inclusive, 1-based)
synthetic-data-kit ingest report.pdf --page-range "[10,25]"
synthetic-data-kit ingest report.pdf --page-range 10-25

Supported formats: PDF, HTML, DOCX, PPTX, TXT, YouTube URLs

Create

Generate synthetic data from parsed documents:

# QA pairs
synthetic-data-kit create data/parsed/document.txt --type qa --num-pairs 30

# Chain-of-Thought reasoning
synthetic-data-kit create data/parsed/document.txt --type cot --num-pairs 20

# Multimodal QA (requires multimodal ingest)
synthetic-data-kit create data/parsed/document.lance --type multimodal-qa

# Directory processing
synthetic-data-kit create ./data/parsed/ --type qa --num-pairs 50

# Custom chunking
synthetic-data-kit create document.txt --type qa --chunk-size 2000 --chunk-overlap 100

# Difficulty control (QA, CoT, Multimodal)
synthetic-data-kit create document.txt --type qa --difficulty advanced
synthetic-data-kit create document.txt --type cot --difficulty medium
synthetic-data-kit create data/parsed/document.lance --type multimodal-qa --difficulty easy

# Using different providers
synthetic-data-kit create document.txt --type qa --provider ollama --model llama3.2:3b
synthetic-data-kit create document.txt --type qa --provider openai --model gpt-4o
synthetic-data-kit create document.txt --type qa --provider vllm --model meta-llama/Llama-3.3-70B-Instruct
synthetic-data-kit create document.txt --type qa --provider api-endpoint --model your-custom-model

# Language control
# Default is English; use --language source to match the input text language (e.g., Arabic)
synthetic-data-kit create document.txt --type qa --language source
synthetic-data-kit create document.txt --type cot --language source

# PDF page range (inclusive, 1-based)
# You can point create directly at a PDF and limit pages
synthetic-data-kit create document.pdf --type cot --language source --page-range "[100,115]"
synthetic-data-kit create document.pdf --type qa --page-range 5-12

Options:

--type: qa, cot, multimodal-qa
--num-pairs: Number of pairs to generate
--chunk-size: Text chunk size (default: 4000)
--chunk-overlap: Overlap between chunks (default: 200)
--difficulty: Question difficulty for generation (easy, medium, advanced) for qa, cot, and multimodal-qa
--language: Output language: english (default) or source to match the input text language
--provider: LLM provider (ollama, openai, vllm, api-endpoint)
--model: Specific model to use (provider-dependent)
--verbose: Show detailed progress
--page-range / --page_range: For PDFs only, inclusive 1-based pages. Accepts "[start,end]" or "start-end".

Curate

Filter generated data for quality using LLM-as-a-judge:

# Single file
synthetic-data-kit curate data/generated/qa_pairs.json --threshold 8.0

# Directory processing
synthetic-data-kit curate ./data/generated/ --threshold 7.5

# Custom batch size
synthetic-data-kit curate qa_pairs.json --batch-size 16

Options:

--threshold: Quality threshold (0-10, default: 7.0)
--batch-size: Processing batch size (default: 8)
--provider: LLM provider (ollama, openai, vllm, api-endpoint)
--model: Specific model to use (provider-dependent)

Save As

Convert curated data to fine-tuning formats:

# Alpaca format
synthetic-data-kit save-as data/curated/cleaned.json --format alpaca

# ChatML format
synthetic-data-kit save-as data/curated/cleaned.json --format chatml

# OpenAI fine-tuning format
synthetic-data-kit save-as data/curated/cleaned.json --format ft

# Hugging Face dataset
synthetic-data-kit save-as data/curated/cleaned.json --format ft --storage hf

# Directory processing
synthetic-data-kit save-as ./data/curated/ --format alpaca

Supported formats:

alpaca: Alpaca instruction format
chatml: ChatML conversation format
ft: OpenAI fine-tuning format
jsonl: JSON Lines format

Storage options:

json: JSON file (default)
hf: Hugging Face dataset

Configuration

The toolkit uses YAML configuration files. Default: configs/config.yaml

Basic Configuration

# LLM Provider Configuration
llm:
  provider: "ollama"  # or "openai", "api-endpoint", "vllm"

# Provider-specific settings
ollama:
  api_base: "http://localhost:11434"
  model: "llama3.2:3b"
  sleep_time: 0.1

openai:
  api_base: "https://api.openai.com/v1"
  api_key: "sk-your-openai-api-key"
  model: "gpt-4o"
  sleep_time: 0.5

# Generation settings
generation:
  temperature: 0.7
  chunk_size: 4000
  num_pairs: 25
  max_context_length: 8000

# Curation settings
curate:
  threshold: 7.0
  batch_size: 8

Custom Configuration

Create a custom config file and use it with the -c flag:

synthetic-data-kit -c my_config.yaml ingest document.pdf

Custom Prompt Templates

Environment Variables (.env)

The CLI automatically loads environment variables from a .env file at the project root if present. This is useful for managing provider credentials without exporting them in your shell.

Examples:

# .env
OPENAI_API_KEY=sk-your-openai-api-key
API_ENDPOINT_KEY=your-custom-endpoint-key

You can still export variables in your shell; .env loading is non-fatal if missing.

Override default prompts in your config:

prompts:
  qa_generation: |
    You are creating question-answer pairs for fine-tuning a {domain} assistant.
    Focus on {focus_areas}.
    
    Create {num_pairs} high-quality question-answer pairs based ONLY on this text.
    
    Return ONLY valid JSON formatted as:
    [
      {{
        "question": "Detailed question?",
        "answer": "Precise answer."
      }},
      ...
    ]
    
    Text:
    ---
    {text}
    ---

Examples

Single Document Processing

# Complete workflow for one PDF
synthetic-data-kit ingest research_paper.pdf
synthetic-data-kit create data/parsed/research_paper.txt --type qa -n 30
synthetic-data-kit curate data/generated/research_paper_qa_pairs.json -t 8.5
synthetic-data-kit save-as data/curated/research_paper_cleaned.json --format ft

Batch Directory Processing

# Process all documents in a directory
synthetic-data-kit ingest ./research_papers/
synthetic-data-kit create ./data/parsed/ --type qa -n 50
synthetic-data-kit curate ./data/generated/ --threshold 8.0
synthetic-data-kit save-as ./data/curated/ --format alpaca --storage hf

Chain-of-Thought Reasoning

# Generate reasoning traces
synthetic-data-kit ingest technical_doc.pdf
synthetic-data-kit create data/parsed/technical_doc.txt --type cot --num-pairs 20
synthetic-data-kit curate data/generated/technical_doc_cot.json --threshold 9.0
synthetic-data-kit save-as data/curated/technical_doc_cot.json --format chatml

Multimodal Processing

# Extract text and images
synthetic-data-kit ingest report.pdf --multimodal
synthetic-data-kit create data/parsed/report.lance --type multimodal-qa

Preview Mode

# See what will be processed
synthetic-data-kit ingest ./documents/ --preview
synthetic-data-kit create ./data/parsed/ --preview --verbose

Document Processing & Chunking

How Chunking Works

The toolkit automatically handles documents of any size:

Small documents (< 8000 characters): Single API call
Large documents (≥ 8000 characters): Split into chunks with overlap

Chunking Parameters

Parameter	Default	Description
`--chunk-size`	4000	Characters per chunk
`--chunk-overlap`	200	Overlap between chunks
`--verbose`	false	Show chunking progress

Custom Chunking

# Single file
synthetic-data-kit create large_doc.txt --type qa --chunk-size 2000 --chunk-overlap 100 --verbose

# Directory
synthetic-data-kit create ./data/parsed/ --type cot --chunk-size 6000 --verbose

Understanding Output

With --verbose, you'll see:

Generating QA pairs...
Document split into 8 chunks
Processing 8 chunks to generate QA pairs...
  Generated 3 pairs from chunk 1 (total: 3/20)
  Generated 2 pairs from chunk 2 (total: 5/20)
  ...
Generated 20 QA pairs total (requested: 20)

Advanced Usage

Mental Model

graph LR
    SDK[synthetic-data-kit] --> SystemCheck[system-check]
    SDK --> Ingest[ingest]
    SDK --> Create[create]
    SDK --> Curate[curate]
    SDK --> SaveAs[save-as]
    
    Ingest --> PDFFile[PDF]
    Ingest --> HTMLFile[HTML]
    Ingest --> YouTubeURL[YouTube]
    Ingest --> DocxFile[DOCX]
    Ingest --> Multimodal[Multimodal]

    Create --> Ollama[Ollama]
    Create --> OpenAI[OpenAI]
    Create --> APIEndpoint[API Endpoint]
    Create --> VLLM[vLLM]
    
    Create --> CoT[CoT]
    Create --> QA[QA Pairs]
    Create --> Summary[Summary]
    Create --> MultimodalQA[Multimodal QA]
    
    Curate --> Filter[Filter by Quality]
    
    SaveAs --> JSONL[JSONL]
    SaveAs --> Alpaca[Alpaca]
    SaveAs --> FT[Fine-Tuning]
    SaveAs --> ChatML[ChatML]

Custom Domain Adaptation

Adapt prompts for specific domains:

prompts:
  qa_generation: |
    You are creating question-answer pairs for fine-tuning a legal assistant.
    Focus on technical legal concepts, precedents, and statutory interpretation.
    
    Create {num_pairs} high-quality question-answer pairs based ONLY on this text.
    
    Return ONLY valid JSON formatted as:
    [
      {{
        "question": "Detailed legal question?",
        "answer": "Precise legal answer."
      }},
      ...
    ]
    
    Text:
    ---
    {text}
    ---

Legacy Directory Structure

The toolkit still supports the legacy structure:

mkdir -p data/{pdf,html,youtube,docx,ppt,txt,output,generated,cleaned,final}

Troubleshooting

Common Issues

Ollama Issues

Installation: curl -fsSL https://ollama.ai/install.sh | sh
Model: ollama pull llama3.2:3b
Server: ollama list
API Check: curl http://localhost:11434/api/tags

OpenAI Issues

API Key: export OPENAI_API_KEY="your-key"
Validation: synthetic-data-kit system-check --provider openai
Usage: Check at https://platform.openai.com/usage

vLLM Issues

Installation: pip install vllm
Server: vllm serve <model_name> --port 8000
Connection: synthetic-data-kit system-check --provider vllm

Memory Issues

Use smaller models (e.g., llama3.2:3b)
Reduce batch size in config
For vLLM: vllm serve <model> --gpu-memory-utilization 0.85

JSON Parsing Issues

Enable verbose: synthetic-data-kit curate file.json -v
Reduce batch size
Ensure LLM supports JSON output
Install: pip install json5

Parser Errors

Install required dependencies:

PDF: pip install pdfminer.six
HTML: pip install beautifulsoup4
YouTube: pip install pytubefix youtube-transcript-api
DOCX: pip install python-docx
PPTX: pip install python-pptx

Testing and Demos

Running Tests

# Provider tests
python tests/unit/test_standalone.py
python tests/unit/test_providers.py

# Functional tests for language option
pytest tests/functional/test_language_option.py -q

Demo Scripts

# Run provider demo
python use-cases/demo_providers.py

This demonstrates:

Ollama and OpenAI client initialization
API calls and configurations
Pipeline integration
Configuration setup

For more examples and use cases, check out:

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
.github		.github
configs		configs
synthetic_data_kit		synthetic_data_kit
tests		tests
use-cases		use-cases
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DOCS.md		DOCS.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Synthetic Data Kit

Table of Contents

Overview

Key Features

Installation

From PyPI (Recommended)

From Source

Dependencies

Quick Start

LLM Providers & Models

Supported Providers

Provider Setup

Ollama (Recommended)

OpenAI

vLLM

Custom API Endpoint

Model Recommendations

Usage

System Check

Ingest

Create

Curate

Save As

Configuration

Basic Configuration

Custom Configuration

Custom Prompt Templates

Environment Variables (.env)

Examples

Single Document Processing

Batch Directory Processing

Chain-of-Thought Reasoning

Multimodal Processing

Preview Mode

Document Processing & Chunking

How Chunking Works

Chunking Parameters

Custom Chunking

Understanding Output

Advanced Usage

Mental Model

Custom Domain Adaptation

Legacy Directory Structure

Troubleshooting

Common Issues

Ollama Issues

OpenAI Issues

vLLM Issues

Memory Issues

JSON Parsing Issues

Parser Errors

Testing and Demos

Running Tests

Demo Scripts

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages