A PostgreSQL-backed, supervisor-managed video processing pipeline for generating large-scale computer vision datasets from YouTube videos.
- 🔍 YouTube Search - Find videos by keywords and hashtags
- 📥 Smart Downloads - Rate-limited downloading with hashtag blocklists
- 🎬 Frame Extraction - Configurable sampling strategies (interval, time, keyframe)
- 🎯 ML Filtering - SigLIP2-based image-text similarity filtering
- 🔄 Deduplication - DINOv3/FAISS-based cross-video deduplication
- 🎯 Object Detection - Open-set detection (GroundingDINO, OWLv2)
flowchart LR
subgraph Central["Central Pipeline"]
D[Download] --> E[Extract]
end
subgraph Project["Per-Project Pipeline"]
F[Filter] --> DU[Cross-Dedup] --> DT[Detect]
end
E --> F
The pipeline uses:
- PostgreSQL for state management with row-level locking
- Supervisor for worker process management
- Heartbeat-based locking for concurrent safety
# Creates new .venv and Install in editable mode
uv sync
# Install with editable mode in exisiting virtual environment
uv pip install -e .
# Initialize database
data-miner init-db
# Add videos and run pipeline
data-miner populate --config config.yaml
data-miner workers setup --config config.yaml
data-miner workers startdata_miner/
├── cli.py # CLI commands
├── config/ # Configuration system
├── db/ # Database layer
├── workers/ # Supervisor-managed workers
├── modules/ # Core processing logic
├── models/ # ML model wrappers
└── utils/ # Utilities
Full documentation is available at mvpavan.github.io/data-miner
| User Guide | Developer Docs |
|---|---|
| Installation | Architecture Overview |
| Configuration | Database Models |
| CLI Reference | Worker System |
| Quickstart | Contributing |
This project is licensed under the MIT License - see the LICENSE file for details.