Data Miner

A PostgreSQL-backed, supervisor-managed video processing pipeline for generating large-scale computer vision datasets from YouTube videos.

✨ Features

🔍 YouTube Search - Find videos by keywords and hashtags
📥 Smart Downloads - Rate-limited downloading with hashtag blocklists
🎬 Frame Extraction - Configurable sampling strategies (interval, time, keyframe)
🎯 ML Filtering - SigLIP2-based image-text similarity filtering
🔄 Deduplication - DINOv3/FAISS-based cross-video deduplication
🎯 Object Detection - Open-set detection (GroundingDINO, OWLv2)

🏗️ Architecture

flowchart LR
    subgraph Central["Central Pipeline"]
        D[Download] --> E[Extract]
    end
    
    subgraph Project["Per-Project Pipeline"]
        F[Filter] --> DU[Cross-Dedup] --> DT[Detect]
    end
    
    E --> F

The pipeline uses:

PostgreSQL for state management with row-level locking
Supervisor for worker process management
Heartbeat-based locking for concurrent safety

🚀 Quick Start

# Creates new .venv and Install in editable mode
uv sync

# Install with editable mode in exisiting virtual environment
uv pip install -e .

# Initialize database
data-miner init-db

# Add videos and run pipeline
data-miner populate --config config.yaml
data-miner workers setup --config config.yaml
data-miner workers start

📁 Project Structure

data_miner/
├── cli.py              # CLI commands
├── config/             # Configuration system
├── db/                 # Database layer
├── workers/            # Supervisor-managed workers
├── modules/            # Core processing logic
├── models/             # ML model wrappers
└── utils/              # Utilities

📚 Documentation

Full documentation is available at mvpavan.github.io/data-miner

User Guide	Developer Docs
Installation	Architecture Overview
Configuration	Database Models
CLI Reference	Worker System
Quickstart	Contributing

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
.claude		.claude
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
annotation-validator		annotation-validator
configs/label_validation		configs/label_validation
data_miner		data_miner
detection_metrics		detection_metrics
docs		docs
k3s_setup		k3s_setup
manual_reviewer		manual_reviewer
manual_reviewer_cvat		manual_reviewer_cvat
scripts		scripts
scripts_bench		scripts_bench
.codex		.codex
.env.example		.env.example
.gitignore		.gitignore
.mailmap		.mailmap
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
README.md		README.md
check_venv_links.sh		check_venv_links.sh
client.py		client.py
docker-compose-db.yaml		docker-compose-db.yaml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Miner

✨ Features

🏗️ Architecture

🚀 Quick Start

📁 Project Structure

📚 Documentation

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Miner

✨ Features

🏗️ Architecture

🚀 Quick Start

📁 Project Structure

📚 Documentation

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages